The Population Haplotyping problem
description
Transcript of The Population Haplotyping problem
![Page 1: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/1.jpg)
The PopulationHaplotyping problem
![Page 2: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/2.jpg)
10 11
01 10
01 00
11 11
10 00
10 00
10 10
0*
**
10
1* 11
*0
*0
NOTATION: each SNP only two values in a population (bio). Call them 0 and 1. Also, call * the fact that a site is heterozygous
HAPLOTYPE: string over 0, 1GENOTYPE: string over 0, 1, * where 0={0}, 1={1}, *={0,1}
![Page 3: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/3.jpg)
10 11
01 10
01 00
11 00
00 10
10 10
0*
**
10
1*
**
*0
0 + 0 =--- 0
1 + 1 =--- 1
0 + 1 + 1 = 0 = --- --- * *
ALGEBRA OF HAPLOTYPES:
Homozygous sites Heterozygous (ambiguous) sites
![Page 4: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/4.jpg)
1**0*
1110110000
1110010001
1100110100
1100010101
Phasing the alleles
For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings
![Page 5: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/5.jpg)
THE PHASING (or HAPLOTYPING) PROBLEM
Given genotypes of k individuals, determine the phasings
of all heterozygous sites.
It is too expensive to determine haplotypes directly
Much cheaper to determine genotypes, and then infer haplotypes in silico:
This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.
![Page 6: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/6.jpg)
The input is GENOTYPE data
00011
11011
*1**1
****1
11**1
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
![Page 7: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/7.jpg)
The input is GENOTYPE data
1101111101
00011
0001111101
1101101101
1101111011
0001100011
11011
*1**1
****1
11**1
OUTPUT: H = { 11011, 11101, 00011, 01101}
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
Each genotype is resolved by two haplotypes
We will define some objectives for H
![Page 8: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/8.jpg)
-without objectives/constraints, the haplotyping problem would be (mathematically)trivial
OBJECTIVES
**0*1 00001 11011
E.g., always put 0 above and 1 below
1*0** 10000 11011
-the objectives/constraints must be “driven by biology”
![Page 9: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/9.jpg)
4°) (parsimony): minimize |H|
1°) Clark’s inference rule
2°) Perfect Phylogeny
3°) Disease Association
OBJECTIVES
![Page 10: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/10.jpg)
Obj: Clark’s rule
1st
![Page 11: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/11.jpg)
1011001011 +?????????? =1**1001*1*
known haplotype h
known (ambiguos) genotype g
Inference Rulefor a compatible pair h , g
![Page 12: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/12.jpg)
1011001011 +1101001110 =1**1001*1*
known haplotype h
known (ambiguos) genotype g
Inference Rulefor a compatible pair h , g
new (derived) haplotype h’
We write h + h’ = g
![Page 13: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/13.jpg)
1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
![Page 14: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/14.jpg)
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
![Page 15: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/15.jpg)
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
00001000**0011**
![Page 16: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/16.jpg)
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
110000001000**0011**
![Page 17: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/17.jpg)
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
1100 1111 SUCCESS
1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
00001000**0011**
![Page 18: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/18.jpg)
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
00001000**0011**
![Page 19: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/19.jpg)
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
0100
00001000**0011**
![Page 20: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/20.jpg)
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
0100 FAILURE (can’t resolve 1122 )
1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
00001000**0011**
![Page 21: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/21.jpg)
1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic: the algorithm could end without explainingall genotypes even if an explanation was possible.
The number of genotypes solved depends on order of application.
1st Objective (Clark, 1990)
OBJ: find order of application rule that leaves the fewest elements in G
![Page 22: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/22.jpg)
The problem was studied by Gusfield(ISMB 2000, and Journal of Comp. Biol., 2001)
- problem is APX-hard
- it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations
-solved via ILP of exponential-size (practical for small real instances)
![Page 23: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/23.jpg)
Obj: Perfect Phylogeny
2nd
![Page 24: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/24.jpg)
- Parsimony does not take into account mutations/evolution of haplotypes
- parsimony is very relialable on “small” haplotype blocks
- when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination)
- the cleanest model for evolution is the perfect phylogeny
![Page 25: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/25.jpg)
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
3rd objective is based on perfect phylogeny
![Page 26: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/26.jpg)
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
has 2 legs
3rd objective is based on perfect phylogeny
has tailflies
![Page 27: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/27.jpg)
- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree
- Leaf nodes are labeled with species
- Each feature labels an edge leading to a subtree that possesses it
has 2 legs
But…a new species may come along so that noPerfect phylogeny is possible…
has tailflies
![Page 28: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/28.jpg)
Theorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11
Human 1 0 0
Mouse 0 1 0
Spider 0 0 0
Eagle 1 0 1
two legs tail
flies
![Page 29: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/29.jpg)
Theorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11
Human 1 0 0
Mouse 0 1 0
Spider 0 0 0
Eagle 1 0 1
Mickey mouse 1 1 0
two legs tail
flies
![Page 30: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/30.jpg)
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently along a tree)
![Page 31: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/31.jpg)
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently along a tree)
0 1 * 0* 1 0 ** 0 * 0
![Page 32: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/32.jpg)
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently along a tree)
0 1 0 00 1 1 01 1 0 10 1 0 01 0 0 00 0 1 0
0 1 * 0* 1 0 ** 0 * 0
![Page 33: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/33.jpg)
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently along a tree)
0 1 0 00 1 1 01 1 0 10 1 0 0 1 0 0 00 0 1 0
NOT a perfect phylogeny solution !
0 1 * 0* 1 0 ** 0 * 0
![Page 34: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/34.jpg)
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently along a tree)
0 1 * 0 0 1 0 *0 0 0 *
![Page 35: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/35.jpg)
We can consider each SNP as a binary feature
Objective: We want the solution to admit a perfect phylogeny
(Rationale : we assume haplotypes have evolved independently along a tree)
0 1 0 0 0 1 1 00 1 0 0 1 1 0 1 0 0 0 00 0 0 1
A perfect phylogeny
0 1 * 0 0 1 0 *0 0 0 *
![Page 36: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/36.jpg)
Theorem: The Perfect Phylogeny Haplotyping problem is polynomial
![Page 37: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/37.jpg)
Theorem: The Perfect Phylogeny Haplotyping problem is polynomial
Algorithms are of combinatorial nature
- There is a graph for which SNPs are columns and edges are of two types (forced and free)
- forced edges connect pairs of SNPs that must be phased in the same way
** 00 + 11 or ** 01 + 10
- a complex visit of the graph decides how to phase free SNPs
![Page 38: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/38.jpg)
Obj: Disease Association
3rd
![Page 39: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/39.jpg)
Some diseases may be due to a gene which has “faulty” configurations
RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease
DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy
Two individuals of which one is healthy and the other diseased may have the same genotype.
The explanation of the disease lies in a difference in their haplotypes
![Page 40: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/40.jpg)
00011
0*011 *1**1
0**01
11**1
INPUT: GD = {11**1,*1**1,0*011}, GH = {11**1,0**01,00011}
11**1
![Page 41: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/41.jpg)
1101111101
0110100001
1101101101
0101100011
0001100011
OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}
H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none
1100111111
00011
0*011 *1**1
0**01
11**1
INPUT: GD = {11**1,*1**1,0*011}, GH = {11**1,0**01,00011}
11**1
![Page 42: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/42.jpg)
![Page 43: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/43.jpg)
Theorem 1 is proved via a reduction from 3 SAT
Theorem 2 has a mathematical proof (coloring argument) with little relation to biology:There is R (depending on input) s.t. a haplotype is healthy if the sum of its bits is congruent to R modulo 3
This means the model must be refined!
![Page 44: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/44.jpg)
![Page 45: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/45.jpg)
Obj: Max Parsimony See separate slides…
4th
![Page 46: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/46.jpg)
Summary:
- haplotyping in-silico needed for economical reasons
- several objectives, all biologically driven
- nice combinatorial problems (mostly due to binary nature of SNPs)
- these problems are technology-dependant and may become obsolete (hopefully after we have retired)
![Page 47: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/47.jpg)
011101
111111
011000
010001
010011
111111
022
222
012
221
011111 022211
012022
012
222
![Page 48: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/48.jpg)
minimize |H|
2nd Objective (parsimony) :
![Page 49: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/49.jpg)
1. The problem is APX-Hard
Reduction from VERTEX-COVER
![Page 50: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/50.jpg)
A
B
C
D E
![Page 51: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/51.jpg)
A
B
C
D E
A B C D E *
![Page 52: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/52.jpg)
A
B
C
D E
A B C D E *
AB BC AE DE AD
![Page 53: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/53.jpg)
A
B
C
D E
A B C D E *
AB BC AE DE AD
A B C D E
![Page 54: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/54.jpg)
A
B
C
D E
A B C D E *
AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2
ABCDE
![Page 55: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/55.jpg)
A
B
C
D E
A B C D E *
AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2
A 0B 0C 0D 0E 0
![Page 56: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/56.jpg)
A
B
C
D E
A B C D E *
AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2
A 0 0 B 0 0C 0 0 D 0 0 E 0 0
![Page 57: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/57.jpg)
A
B
C
D E
A B C D E *
AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2
A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
![Page 58: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/58.jpg)
A
B
C
D E
A B C D E *
AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2
A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes
![Page 59: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/59.jpg)
A
B
C
D E
A B C D E *
AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2
A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes
![Page 60: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/60.jpg)
A
B
C
D E
A B C D E *
AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2
A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1B’ 1 0 1 1 1 1E’ 1 1 1 1 0 1
G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes
![Page 61: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/61.jpg)
A basic ILP formulation
![Page 62: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/62.jpg)
Expand your input G in all possible ways
220 120 022
A basic ILP formulation
![Page 63: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/63.jpg)
Expand your input G in all possible ways
010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010
220 120 022
A basic ILP formulation
![Page 64: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/64.jpg)
hx
21, hh
hx
yhh 21 ,
Expand your input G in all possible ways
010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010
220 120 022
A basic ILP formulation
![Page 65: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/65.jpg)
The resulting Integer Program (IP1):
![Page 66: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/66.jpg)
![Page 67: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/67.jpg)
Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations
![Page 68: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/68.jpg)
![Page 69: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/69.jpg)
The input is GENOTYPE data
1101111101
0001111101
1101101101
1101111011
0001100011
OUTPUT: H = { 11011, 11101, 00011, 01101}
Each genotype is explained by two haplotypes
We will define some objectives for H
INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }
****1
11**1
OOO11
11O11
*1**1
![Page 70: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/70.jpg)
1st Objective (open research problem):
minimize |H|
2nd Objective based on inference rule:
![Page 71: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/71.jpg)
1st Objective (parsimony) :
minimize |H|
An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.
BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving asolution of <= 2n haplotypes, i.e. <= SQRT(n) * LB
It’s difficult, but not impossible, to come up with better approximations, like constants(Lancia, Pinotti, Rizzi ’02)
![Page 72: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/72.jpg)
2nd Objective based on inference rule:
![Page 73: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/73.jpg)
xoxxooxoxx +********** =x??xoox?x?
known haplotype h
known (ambiguos) genotype g
Inference Rule
![Page 74: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/74.jpg)
xoxxooxoxx +xxoxooxxxo =x??xoox?x?
known haplotype h
known (ambiguos) genotype g
new (derived) haplotype h’
Inference Rule
![Page 75: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/75.jpg)
xoxxooxoxx +xxoxooxxxo =x??xoox?x?
known haplotype h
known (ambiguos) genotype g
new (derived) haplotype h’
We write h + h’ = g
g and h must be compatible to derive h’
Inference Rule
![Page 76: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/76.jpg)
2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
![Page 77: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/77.jpg)
2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
![Page 78: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/78.jpg)
2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
![Page 79: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/79.jpg)
2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
xxoo
![Page 80: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/80.jpg)
2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
xxoo xxxx SUCCESS
![Page 81: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/81.jpg)
2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
![Page 82: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/82.jpg)
2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
oxoo
![Page 83: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/83.jpg)
2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while
If, at end, G is empty, SUCCESS, otherwise FAILURE
Step 3 is non-deterministic
ooooxooo??ooxx??
oxoo FAILURE (can’t resolve xx?? )
OBJ: find order of application rule that leaves the fewest elements in G
![Page 84: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/84.jpg)
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
![Page 85: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/85.jpg)
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
1. expand genotypes
![Page 86: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/86.jpg)
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
xxxox
xxxooxxooxxxoooxoxox
xoooxxoxoo
xoooo
1. expand genotypes
![Page 87: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/87.jpg)
- Problem is APX-hard (Gusfield,00)
- Graph-Model + Integer Programming for practical solution (G.,01)
x??o?
xxxox
xxxooxxooxxxoooxoxox
xoooxxoxoo
xoooo
2. create (h, h’) if exists g s.t. h’ can bederived from g and h
1. expand genotypes 3. Largest number of nodes in forest
rooted at unambiguos genotpes = = largest number of ambiguous genotypes resolved
Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).
This reduction is exponential. Is there a better practical approach?
![Page 88: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/88.jpg)
3rd Objective (open research problem)Disease Detection:
oooxx
??oxx
?x??x
????x
xx??x
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
![Page 89: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/89.jpg)
3rd Objective (open research problem)Disease Detection:
xxoxxxxxox
oooxx
oooxxxxxox
xxoxxoxxox
xxoxxoooxx
oooxxoooxx
??oxx
?x??x
????x
xx??x
OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}
H contains H’, s.t. each diseased has one haplotype in H’ and each healty none minimize | H|
INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }
![Page 90: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/90.jpg)
Genome Rearrangements and Evolutionary Distances
![Page 91: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/91.jpg)
Each species has a genome (organized in pairs of chromosomes)
tcgtgatggat………………ttgatggattga
tcgattatggat………………ttttgatatcca
Genomes evolve by means of
• Insertions• Deletions• Inversions• Transpositions• Translocations
of DNA regions
![Page 92: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/92.jpg)
![Page 93: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/93.jpg)
deletion
![Page 94: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/94.jpg)
deletioninsertion
![Page 95: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/95.jpg)
deletioninsertion
translocation
![Page 96: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/96.jpg)
deletioninsertion
translocationinversion
![Page 97: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/97.jpg)
deletioninsertion
translocationinversion
transposition
![Page 98: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/98.jpg)
Combinatorial problem: given 2 permutations P, Q and operators in a set F find ashortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions)(…still difficult…)Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transpositions…)
5 6 4 8 3 2 1 9 7Example:
We focus on inversions, that are the most important in Nature
1 2 3 8 4 6 5 9 7 1 2 3 8 4 5 6 9 7 1 2 3 6 5 4 8 9 7 1 2 3 6 5 4 8 7 9 1 2 3 4 5 6 8 7 9 1 2 3 4 5 6 7 8 9
![Page 99: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/99.jpg)
Combinatorial problem: given 2 permutations P, Q and operators in a set F find ashortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))
Very difficult problem! We focus on operators all of the same type (e.g. inversions)(…still difficult…)Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transposition…)
+5 +6 -4 -8 -3 -2 -1 -9 +7Example:
We focus on inversions, that are the most important in Nature
+1 +2 +3 +8 +4 -6 -5 -9 +7+1 +2 +3 +8 +4 +5 +6 -9 +7+1 +2 +3 -6 -5 -4 -8 -9 +7+1 +2 +3 -6 -5 -4 -8 -7 +9+1 +2 +3 +4 +5 +6 -8 -7 +9+1 +2 +3 +4 +5 +6 +7 +8 +9
There is also a SIGNED VERSION of the problem !
![Page 100: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/100.jpg)
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
![Page 101: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/101.jpg)
5 7 8 2 1 4 3 6 9
The concept of breakpoint
Breakpoint at position i if | p(i) - p(i+1) | > 1
0 10
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
![Page 102: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/102.jpg)
(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)
Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)
The complexity of Sorting by Transpositions, e.g., is unknown
5 7 8 2 1 4 3 6 9
The concept of breakpoint
Breakpoint at position i if | p(i) - p(i+1) | > 1
0 10
d(p) = inversion distanceb(p) = # breakpoints
TRIVIAL BOUND: d(p) >= b(p) / 2
Example: d(p) >= 6 / 2 = 3
![Page 103: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/103.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
![Page 104: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/104.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
![Page 105: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/105.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
![Page 106: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/106.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
![Page 107: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/107.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
10 64
Each node has degree...
0 2 or 4 …
hence the graph can be decomposed in cycles!
![Page 108: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/108.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
![Page 109: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/109.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
![Page 110: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/110.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
Alternating cycle decomposition
c(p) = max # cycles in alternating decomposition
VERY STRONG BOUND : d (p) >= b(p) - c(p)
Example: c(p)= 2 and d (p) >= 6 - 2 = 4
![Page 111: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/111.jpg)
The Breakpoint Graph
5 7 8 2 1 4 3 6 9 0
10
The best algorithm for this problem is based on an Integer Programmingformulation of the max cycle decomposition
A variable xC for each cycle (exponential # of vars…)
A constraint S xC = 1 for each edge e
Objective: maximize SC xC
C containing e
![Page 112: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/112.jpg)
max S xCC
S xC = 1 for all edges eC\ni e
xC \in {0,1} for all alt. cycles C
PRIMAL
min S yee
S ye <= 1 for all alt. Cycles Ce\in C
ye \in R for all edges e
DUAL
![Page 113: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/113.jpg)
max S xCC
S xC = 1 for all edges eC\ni e
xC \in {0,1} for all alt. cycles C
PRIMAL
min S yee
S ye <= 1 for all alt. Cycles Ce\in C
ye \in R for all edges e
DUAL
![Page 114: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/114.jpg)
5 7 8 2 1 4 3 6 9 0
10
Pricing out the cycles for which y*(C) < 1
![Page 115: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/115.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Split the graph in two copies
![Page 116: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/116.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Connect twins
![Page 117: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/117.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
![Page 118: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/118.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
![Page 119: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/119.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
![Page 120: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/120.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
![Page 121: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/121.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
A perfect matching corresponds to (a set of) alternating cycles
![Page 122: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/122.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
The weight of the matching is the y*-weight of the cycles
.2
.4
.5
1
.6
0
![Page 123: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/123.jpg)
5 7 8 2 1 4 3 6 9 0
10
5 7 8 2 1 4 3 6 9 0
10
Forcing a cycle to use a certain node
.2
.4
.5
1
.6
100000
![Page 124: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/124.jpg)
- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)
BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)
BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)
KS can solve at most n=40. Take days for n=50
CLN can solve for n=200. Takes few seconds (say 5) for n=100
NP-hard problem practically solved to optimality!
![Page 125: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/125.jpg)
Statistical view of evolution• Genome evolve by random inversions• It’s like a random walk on a huge graph with an edge for
each permutation an edge for each inversion• It is not clear why the shortest solution should be the
one followed by Nature (in fact, often it isn’t)• We want to find the most likely number of inversions
that lead from (1 2 … n ) to p• We use the expected number of breakpoints after k
inversions as a way to guess the # of inversions
![Page 126: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/126.jpg)
Let B(k) be the (r.v.) number of breakpoint after k random inversions from (1..n)
Given a p obtained by h random inversions from (1 … n ) we want to estimate h
The inversion distance is only a lower bound: h >= d(p) but the gap could be big
We estimate E[B(k)]. Then, faced with some p, we pick h such that E[B(h)] is as close as possible to b(p) (maximum likelihood). CL ,2000, have shown:
Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions
E[B(k)] = ( n - 1 ) ( 1 - ( ) )
n - 3n - 1
k
![Page 127: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/127.jpg)
Example: n = 200, k (u.a.r. in 1…n) inversions
8 8 8 1619 19 19 3468 67 67 9869 73 68 10473 79 73 10985 91 83 12086 85 83 11587 90 84 119118 117 109 138184 184 135 168
k k’ d(p) b
![Page 128: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/128.jpg)
Protein Structure Alignments: the Maximum Contact Map Overlap
Problem
![Page 129: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/129.jpg)
A Protein is a complex molecule with a primary, linear structure (a sequence of aminoacids) and a3-Dimensional structure (the protein fold).
Protein STRUCTURE determines its FUNCTION
For instance, the Drug Design problemcalls for constructing peptides with a 3Dshape complementary to a protein, so asto dock onto it.
![Page 130: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/130.jpg)
Motivation:Structure Alignment is Important for:
- Discovery of Protein Function (shape determines function)
- Search in 3D data bases
- Protein Classification and Evolutionary Studies
- ...
Problem: Align two 3D protein structures
![Page 131: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/131.jpg)
Contact Maps
![Page 132: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/132.jpg)
Unfolded protein
CONTACT MAPS
![Page 133: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/133.jpg)
Unfolded protein
Folded protein = contacts
CONTACT MAPS
![Page 134: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/134.jpg)
Unfolded protein
Folded protein = contacts
Contact map = graph
CONTACT MAPS
![Page 135: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/135.jpg)
CONTACT MAPS
Unfolded protein
Folded protein = contacts
Contact map = graph
OBJECTIVE: align 3d folds of proteins = align contact maps
![Page 136: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/136.jpg)
Contact Map Alignments
![Page 137: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/137.jpg)
Non-crossing Alignments
Protein 1
Protein 2
non-crossing map of residues in protein 1 and protein 2
![Page 138: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/138.jpg)
The value of an alignment
![Page 139: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/139.jpg)
The value of an alignment
![Page 140: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/140.jpg)
The value of an alignment
![Page 141: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/141.jpg)
Value = 3
The value of an alignment
![Page 142: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/142.jpg)
Value = 3We want to maximize the value
The value of an alignment
![Page 143: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/143.jpg)
NP-Hard
The value of an alignment
![Page 144: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/144.jpg)
Integer Programming Formulation
![Page 145: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/145.jpg)
Integer Programming Formulation
0-1 VARIABLESyef for e and f contacts
e
f
yef
![Page 146: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/146.jpg)
Integer Programming Formulation
0-1 VARIABLES
yef + ye’f’ <= 1
yef for e and f contacts
e
f
yef
CONSTRAINTS
e
f
e’
f’
![Page 147: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/147.jpg)
Integer Programming Formulation
0-1 VARIABLES
yef + ye’f’ <= 1
yef for e and f contacts
e
f
yef
CONSTRAINTS
e
f
e’
f’
OBJECTIVE max SeSf yef
![Page 148: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/148.jpg)
Independent Set ProblemIt’s just a huge max independent set problem in Gy:
• a node for each sharing • an edge for each pair of incompatible sharings
e
f
e’
f’f’’
e’’
ef
e’f’
e’’f’’
![Page 149: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/149.jpg)
Independent Set ProblemIt’s just a huge max independent set problem in Gy:
• a node for each sharing • an edge for each pair of incompatible sharings
e
f
e’
f’f’’
e’’
ef
e’f’
e’’f’’
|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)
The best exact algorithm for independent set can solve for at most a few hundred nodes
![Page 150: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/150.jpg)
Node to Node VariablesNew variables x provide an easy check for the non-crossing conditions
NEW VARIABLESxij for i and j residues
e
f
yef
i
jxij
![Page 151: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/151.jpg)
Node to Node VariablesNew variables x provide an easy check for the non-crossing conditions
NEW VARIABLESxij for i and j residues
e
f
yef
NEW CONSTRAINTS
i
j
i’
j’
xij + xi’j’ <= 1
i
jxij
![Page 152: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/152.jpg)
Node to Node VariablesNew variables x provide an easy check for the non-crossing conditions
NEW VARIABLES
y(ip)(jq) <= xij and y(ip)(jq) <= xpq
xij for i and j residues
e
f
yef
NEW CONSTRAINTS
i
j
i’
j’
xij + xi’j’ <= 1
i
jxij
i
j
p
q
![Page 153: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/153.jpg)
Clique ConstraintsVariables x define a graph Gx:
• A node for each line• An edge between each pair of crossing lines
i
j
i’
j’
ij
i’j’
![Page 154: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/154.jpg)
Clique ConstraintsVariables x define a graph Gx:
• Gx is much smaller than Gy• Gx has nice proprieties (it’s a perfect graph)• It’s easier to find large independent sets in Gx
• A node for each line• An edge between each pair of crossing lines
i
j
i’
j’
ij
i’j’
![Page 155: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/155.jpg)
Clique ConstraintsNon-crossing constraints can be extended to
CLIQUE CONSTRAINTS
S xij <= 1[i,j] in M
For all sets M of mutually incompatible (i.e. crossing) lines
All clique constraints satisfied (and Gx perfect) imply a strong bound!
![Page 156: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/156.jpg)
Structure of Maximal cliques in Gx
1. Pick two subsets of same size
![Page 157: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/157.jpg)
Structure of Maximal cliques in Gx
2. Connect them in a zig-zag fashion
![Page 158: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/158.jpg)
Structure of Maximal cliques in Gx
![Page 159: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/159.jpg)
Structure of Maximal cliques in Gx
![Page 160: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/160.jpg)
Structure of Maximal cliques in Gx
![Page 161: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/161.jpg)
Structure of Maximal cliques in Gx
![Page 162: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/162.jpg)
Structure of Maximal cliques in Gx
![Page 163: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/163.jpg)
Structure of Maximal cliques in Gx
![Page 164: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/164.jpg)
Structure of Maximal cliques in Gx
![Page 165: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/165.jpg)
Structure of Maximal cliques in Gx
![Page 166: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/166.jpg)
Structure of Maximal cliques in Gx
![Page 167: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/167.jpg)
Structure of Maximal cliques in Gx
3. Throw in all lines included in a zig or a zag
![Page 168: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/168.jpg)
Structure of Maximal cliques in Gx
3. Throw in all lines included in a zig or a zag
![Page 169: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/169.jpg)
Structure of Maximal cliques in Gx
The result is a maximal clique in Gx
![Page 170: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/170.jpg)
Separation of Clique Inequalities
![Page 171: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/171.jpg)
Separation of Clique InequalitiesPROBLEM
There exist exponentially many such cliques (O(22n) inequalities).
We need to generate in polynomial time a clique inequality when needed,i.e., when violated by the current LP solution x*
S x*ij > 1[i,j] in M
THEOREM
We can find the most violated clique inequality in time O(n2)
![Page 172: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/172.jpg)
Separation of Clique InequalitiesPROOF (sketch)
1) Clique = zigzag path
![Page 173: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/173.jpg)
Separation of Clique InequalitiesPROOF (sketch)
1) Clique = zigzag path
1 2 3 4 5 6 7 8
![Page 174: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/174.jpg)
Separation of Clique InequalitiesPROOF (sketch)
1) Clique = zigzag path 2) Flip one graph: zigzag leftright
1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1
![Page 175: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/175.jpg)
Separation of Clique InequalitiesPROOF (sketch)
1) Clique = zigzag path 2) Flip one graph: zigzag leftright
1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1
3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.to find longest path in grid, time O(n^2)
![Page 176: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/176.jpg)
Separation of cliques
n2
1n11 2
2
i
u
Create n1 x n2 gridOrient all edges and give weights
![Page 177: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/177.jpg)
Separation of cliques
n2
1n11 2
2
i
u
Create n1 x n2 gridOrient all edges and give weights
x*iu
x*iu
![Page 178: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/178.jpg)
Separation of cliques
Create n1 x n2 gridOrient all edges and give weightsThere is violated clique iff longest A,B path has length > 1
A=(1,n2)
B=(n1,1)
![Page 179: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/179.jpg)
Gx is a Perfect Graph
We show why polynomial separation is possible:
Gx is weakly triangulated (no chordless cycles >= 5 in Gx or Gx)
=> Gx is perfect (Hayward, 1985)
![Page 180: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/180.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
PROOF (Sketch, for Gx)
L1 and L3 don’t cross. Wlog RIGHT(L3, L1)
![Page 181: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/181.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1 L3
L1 and L3 don’t cross. Wlog RIGHT(L3, L1)
![Page 182: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/182.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1 L3
For i=4,5,… Li crosses Li-1 but not L1=> RIGHT (Li, L1)
![Page 183: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/183.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1 L3
For i=4,5,… Li crosses Li-1 but not L1=> RIGHT (Li, L1)
L4
![Page 184: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/184.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
For i=4,5,… Li crosses Li-1 but not L1=> RIGHT (Li, L1)
L1
L4L5
![Page 185: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/185.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5
For i=4,5,… Li crosses Li-1 but not L1=> RIGHT (Li, L1)
L1 L5L6
![Page 186: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/186.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
We get LEFT(L1, {L3, L4, L5, L6})
L3, L4, L5 L6
L6
![Page 187: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/187.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5
L3, L4, L5 L6
L6
![Page 188: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/188.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5
L3, L4, L5 L6
L6
L2, L3, L4 L5
![Page 189: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/189.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
Then {L3, L4, L5} are between L1 and L6
L3, L4, L5 L6
L6
L2, L3, L4 L5
![Page 190: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/190.jpg)
Gx is a Perfect Graph
L1
L2
L3
L4
L7
L6
L5L1
Then {L3, L4, L5} are between L1 and L6
L3, L4, L5 L6
L6
L2, L3, L4 L5
But L7 crosses L1 and L6, and so should cross them all !
L7
![Page 191: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/191.jpg)
The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)It can be applied to small or moderate proteins (up to 80 residues/150 contacts)
In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIANRELAXATION. Approach borrowed from Quadratic Assignment. With newapproach we can solve important proteins (up to 150 residues/300 contacts)
![Page 192: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/192.jpg)
What about Heuristics?E.g., genetic algorithms…
![Page 193: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/193.jpg)
Genetic Algorithm Overview
• A Population of candidate solutions thatevolve (improve) over time
• Recombination creates new candidate solutions viacrossover and mutation
Populationat time t
Populationat time t+1
Recombinationoperators
Evaluationfunction
![Page 194: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/194.jpg)
Crossover• Crossover selects pieces from both parents and creates two
offspring solutions
Blue Parent
Offspring
Red Parent
![Page 195: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/195.jpg)
Crossover• Crossover selects pieces from both parents and creates two
offspring solutions– Select a set of edges in one parent to copy to the child
![Page 196: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/196.jpg)
Crossover• Crossover selects pieces from both parents and creates two
offspring solutions– Select a set of edges in one parent to copy to the child
![Page 197: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/197.jpg)
Crossover• Crossover selects pieces from both parents and creates two
offspring solutions– Select a set of edges in one parent to copy to the child– Copy as many edges as possible from the other parent
![Page 198: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/198.jpg)
Crossover• Crossover selects pieces from both parents and creates two
offspring solutions– Select a set of edges in one parent to copy to the child– Copy as many edges as possible from the other parent
These edges conflict with existingedges and are not copied
![Page 199: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/199.jpg)
Crossover• Crossover selects pieces from both parents and creates two
offspring solutions– Select a set of edges in one parent to copy to the child– Copy as many edges as possible from the other parent– Add random edges to fill any remaining space
![Page 200: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/200.jpg)
Crossover• Crossover selects pieces from both parents and creates two
offspring solutions– Select a set of edges in one parent to copy to the child– Copy as many edges as possible from the other parent– Add random edges to fill any remaining space
![Page 201: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/201.jpg)
Mutation• Mutation introduces small changes to existing solutions by
shifting edge endpoints
![Page 202: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/202.jpg)
Mutation• Mutation introduces small changes to existing solutions by
shifting edge endpoints– Select a set of endpoints to shift
![Page 203: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/203.jpg)
Mutation• Mutation introduces small changes to existing solutions by
shifting edge endpoints– Select a set of endpoints to shift
![Page 204: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/204.jpg)
Mutation• Mutation introduces small changes to existing solutions by
shifting edge endpoints– Select a set of endpoints to shift
This edge “fell off” theend of the contact map
and is removed
![Page 205: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/205.jpg)
Mutation• Mutation introduces small changes to existing solutions by
shifting edge endpoints– Select a set of endpoints to shift– Randomly add new edges
![Page 206: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/206.jpg)
Mutation• Mutation introduces small changes to existing solutions by
shifting edge endpoints– Select a set of endpoints to shift– Randomly add new edges
![Page 207: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/207.jpg)
Computational Results
![Page 208: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/208.jpg)
Computational Results
• 269 proteins– 70 -100 residues– 80 to 140 contacts
• Picked 10,000 pairs of proteins out of 36046 possible• Took a weekend on PC• 500 were solved to optimality• 2500 had a gap <= 10 contacts
![Page 209: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/209.jpg)
Skolnick Clustering Test
![Page 210: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/210.jpg)
Skolnick Results• Four Families
1 Flavodoxin-like fold Che-Y related2 Plastocyanin3 TIM Barrel4 Ferratin
• alpha-beta• 8 structures• up to 124 residues• 15-30% sequence similarity• < 3Å RMSD
![Page 211: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/211.jpg)
Skolnick Results• Four Families
1 Flavodoxin-like fold Che-Y related2 Plastocyanin3 TIM Barrel4 Ferratin
• beta• 8 structures• up to 99 residues• 35-90% sequence similarity• < 2Å RMSD
![Page 212: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/212.jpg)
Skolnick Results• Four Families
1 Flavodoxin-like fold Che-Y related2 Plastocyanin3 TIM Barrel4 Ferratin
• alpha-beta• 11 structures• up to 250 residues• 30-90% sequence similarity• < 2Å RMSD
![Page 213: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/213.jpg)
Skolnick Results• Four Families
1 Flavodoxin-like fold Che-Y related2 Plastocyanin3 TIM Barrel4 Ferratin
• alpha• 6 structures• up to 170 residues• 7-70% sequence similarity• < 4Å RMSD
![Page 214: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/214.jpg)
Skolnick Results
Family Style Residues Seq. Sim. RMSD Proteins1 alpha-beta 124 15-30% < 3A 1b00, 1dbw, 1nat, 1ntr,
1qmp, 1rnl, 3cah, 4tmy2 beta 99 35-90% < 2A 1baw, 1byo, 1kdi, 1nin,
1pla, 3b3i, 2pcy, 2plt3 alpha-beta 250 30-90% < 2A 1amk, 1aw2, 1b9b, 1btm,
1hti, 1tmh, 1tre, 1tri,1ydv, 3ypi, 8tim
4 170 7-70% < 4A 1b71, 1bcf, 1dps, 1fha,1ier, 1rcd
• Four Families1 Flavodoxin-like fold Che-Y related2 Plastocyanin3 TIM Barrel4 Ferratin
![Page 215: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/215.jpg)
Clustering
Define score(P1, P2) as
0 <= # shared contacts
Min # of contacts of P1,P2
<= 1
Put P1, P2 in same family if score(P1, P2) >= threshold
![Page 216: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/216.jpg)
Clustering
Define score(P1, P2) as
0 <= # shared contacts
Min # of contacts of P1,P2
<= 1
Put P1, P2 in same family if score(P1, P2) >= threshold
If P1, P2 too big, use G.A. and local search to compute score
L.P. gives then bounds:
HEUR score <= OPT score <= LP bound
and we know how far off OPT we are
![Page 217: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/217.jpg)
Clustering validationWe got some known families from biologists, PDB.
Experiment: Take a family F of proteins and align them against each other and against the remaining.
![Page 218: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/218.jpg)
Clustering validationWe got some known families from biologists, PDB.
0.05 MISMATCH0.1 MISMATCH0.15 MISMATCH0.2 MISMATCH0.25 MISMATCH0.3 MISMATCH0.35 MATCH…… ……1.0 MATCH
score proteins were…
Experiment: Take a family F of proteins and align them against each other and against the remaining.
TYPICAL BEHAVIOUR
![Page 219: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/219.jpg)
Skolnick Results• Performance
– 528 alignments– 1.3% false negative– 0.0% false positive
![Page 220: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/220.jpg)
Clustering
Computed, for 1st time, provably optimal alignments for 150 pairs(inter-family)
Used the CMO value to cluster: retrieves the clusters.
Set S(i,j) = 1 if CMO >= a, S(i,j) = 0 otherwise
Use TSP to find a block diagonal structure for S
![Page 221: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/221.jpg)
Clustering
![Page 222: The Population Haplotyping problem](https://reader036.fdocuments.in/reader036/viewer/2022062323/568165c3550346895dd8cf37/html5/thumbnails/222.jpg)
Last Open Problem
? ?