The Population Haplotyping problem

The PopulationHaplotyping problem

10 11

01 10

01 00

11 11

10 00

10 00

10 10

0*

**

10

1* 11

*0

*0

NOTATION: each SNP only two values in a population (bio). Call them 0 and 1. Also, call * the fact that a site is heterozygous

HAPLOTYPE: string over 0, 1GENOTYPE: string over 0, 1, * where 0={0}, 1={1}, *={0,1}

10 11

01 10

01 00

11 00

00 10

10 10

0*

**

10

1*

**

*0

0 + 0 =--- 0

1 + 1 =--- 1

0 + 1 + 1 = 0 = --- --- * *

ALGEBRA OF HAPLOTYPES:

Homozygous sites Heterozygous (ambiguous) sites

1**0*

1110110000

1110010001

1100110100

1100010101

Phasing the alleles

For k heterozygous (ambiguous) sites, there are 2k-1 possible phasings

THE PHASING (or HAPLOTYPING) PROBLEM

Given genotypes of k individuals, determine the phasings

of all heterozygous sites.

It is too expensive to determine haplotypes directly

Much cheaper to determine genotypes, and then infer haplotypes in silico:

This yields a set H, of (at most) 2k haplotypes. H is a resolution of G.

The input is GENOTYPE data

00011

11011

*1**1

****1

11**1

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }


1101111101

00011

0001111101

1101101101

1101111011

0001100011

11011

*1**1

****1

11**1

OUTPUT: H = { 11011, 11101, 00011, 01101}

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

Each genotype is resolved by two haplotypes

We will define some objectives for H

-without objectives/constraints, the haplotyping problem would be (mathematically)trivial

OBJECTIVES

**0*1 00001 11011

E.g., always put 0 above and 1 below

1*0** 10000 11011

-the objectives/constraints must be “driven by biology”

4°) (parsimony): minimize |H|

1°) Clark’s inference rule

2°) Perfect Phylogeny

3°) Disease Association

OBJECTIVES

Obj: Clark’s rule

1st

1011001011 +?????????? =1**1001*1*

known haplotype h

known (ambiguos) genotype g

Inference Rulefor a compatible pair h , g

1011001011 +1101001110 =1**1001*1*

known haplotype h


Inference Rulefor a compatible pair h , g

new (derived) haplotype h’

We write h + h’ = g

1st Objective (Clark, 1990)1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while

If, at end, G is empty, SUCCESS, otherwise FAILURE

Step 3 is non-deterministic





00001000**0011**




110000001000**0011**



1100 1111 SUCCESS


00001000**0011**




00001000**0011**




0100

00001000**0011**



0100 FAILURE (can’t resolve 1122 )


00001000**0011**

1. Start with H = “bootstrap” haplotypes2. while Clark’s rule applies to a pair (h, g) in H x G3. apply the rule to any such (h, g) obtaining h’4. set H = H + {h’} and G = G - {g}5. end while


Step 3 is non-deterministic: the algorithm could end without explainingall genotypes even if an explanation was possible.

The number of genotypes solved depends on order of application.

1st Objective (Clark, 1990)

OBJ: find order of application rule that leaves the fewest elements in G

The problem was studied by Gusfield(ISMB 2000, and Journal of Comp. Biol., 2001)

- problem is APX-hard

- it corresponds to finding largest forest in a graph with haplotypes as nodes and arcs for possible derivations

-solved via ILP of exponential-size (practical for small real instances)

Obj: Perfect Phylogeny

2nd

- Parsimony does not take into account mutations/evolution of haplotypes

- parsimony is very relialable on “small” haplotype blocks

- when haplotypes are large (span several SNPs, we should consider evolutionionary events and recombination)

- the cleanest model for evolution is the perfect phylogeny

- A phylogeny expalains set of binary features (e.g. flies, has fur…) with a tree

- Leaf nodes are labeled with species

- Each feature labels an edge leading to a subtree that possesses it

3rd objective is based on perfect phylogeny




has 2 legs

3rd objective is based on perfect phylogeny

has tailflies




has 2 legs

But…a new species may come along so that noPerfect phylogeny is possible…

has tailflies

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

two legs tail

flies

Theorem: such matrix has p.p. iff there is not a 00 4x2 minor 10 01 11

Human 1 0 0

Mouse 0 1 0

Spider 0 0 0

Eagle 1 0 1

Mickey mouse 1 1 0

two legs tail

flies

We can consider each SNP as a binary feature

Objective: We want the solution to admit a perfect phylogeny

(Rationale : we assume haplotypes have evolved independently along a tree)




0 1 * 0* 1 0 ** 0 * 0




0 1 0 00 1 1 01 1 0 10 1 0 01 0 0 00 0 1 0

0 1 * 0* 1 0 ** 0 * 0




0 1 0 00 1 1 01 1 0 10 1 0 0 1 0 0 00 0 1 0

NOT a perfect phylogeny solution !

0 1 * 0* 1 0 ** 0 * 0




0 1 * 0 0 1 0 *0 0 0 *




0 1 0 0 0 1 1 00 1 0 0 1 1 0 1 0 0 0 00 0 0 1

A perfect phylogeny

0 1 * 0 0 1 0 *0 0 0 *

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Theorem: The Perfect Phylogeny Haplotyping problem is polynomial

Algorithms are of combinatorial nature

- There is a graph for which SNPs are columns and edges are of two types (forced and free)

- forced edges connect pairs of SNPs that must be phased in the same way

** 00 + 11 or ** 01 + 10

- a complex visit of the graph decides how to phase free SNPs

Obj: Disease Association

3rd

Some diseases may be due to a gene which has “faulty” configurations

RECESSIVE DISEASE (e.g. cystic fibrosis, sickle cell anemia): to be diseased one must have both copies faulty. With one copy one is a carrier of the disease

DOMINANT DISEASE (e.g. Huntington’s disease, Marfan’s syndrome): to be diseased it is enough to have one faulty copy

Two individuals of which one is healthy and the other diseased may have the same genotype.

The explanation of the disease lies in a difference in their haplotypes

00011

0*011 *1**1

0**01

11**1

INPUT: GD = {11**1,*1**1,0*011}, GH = {11**1,0**01,00011}

11**1

1101111101

0110100001

1101101101

0101100011

0001100011

OUTPUT: H = { 11011,01011,00001,11111,11101,00011,01101}

H contains HD, s.t. each diseased has >=1 haplotype in HD and each healty none

1100111111

00011

0*011 *1**1

0**01

11**1

INPUT: GD = {11**1,*1**1,0*011}, GH = {11**1,0**01,00011}

11**1

Theorem 1 is proved via a reduction from 3 SAT

Theorem 2 has a mathematical proof (coloring argument) with little relation to biology:There is R (depending on input) s.t. a haplotype is healthy if the sum of its bits is congruent to R modulo 3

This means the model must be refined!

Obj: Max Parsimony See separate slides…

4th

Summary:

- haplotyping in-silico needed for economical reasons

- several objectives, all biologically driven

- nice combinatorial problems (mostly due to binary nature of SNPs)

- these problems are technology-dependant and may become obsolete (hopefully after we have retired)

011101

111111

011000

010001

010011

111111

022

222

012

221

011111 022211

012022

012

222

minimize |H|

2nd Objective (parsimony) :

1. The problem is APX-Hard

Reduction from VERTEX-COVER

A

B

C

D E

A

B

C

D E

A B C D E *

A

B

C

D E

A B C D E *

AB BC AE DE AD

A

B

C

D E

A B C D E *

AB BC AE DE AD

A B C D E

A

B

C

D E

A B C D E *

AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2

ABCDE

A

B

C

D E

A B C D E *

AB 2 2BC 2 2AE 2 2DE 2 2AD 2 2

A 0B 0C 0D 0E 0

A

B

C

D E

A B C D E *

AB 2 2 2 BC 2 2 2 AE 2 2 2 DE 2 2 2 AD 2 2 2

A 0 0 B 0 0C 0 0 D 0 0 E 0 0

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

A

B

C

D E

A B C D E *

AB 2 2 1 1 1 2BC 1 2 2 1 1 2AE 2 1 1 1 2 2DE 1 1 1 2 2 2 AD 2 1 1 2 1 2

A 0 1 1 1 1 0 B 1 0 1 1 1 0C 1 1 0 1 1 0 D 1 1 1 0 1 0 E 1 1 1 1 0 0 A’ 0 1 1 1 1 1B’ 1 0 1 1 1 1E’ 1 1 1 1 0 1

G = (V,E) has a node cover X of size k there is a set H of |V | + k haplotypes that explain all genotypes

A basic ILP formulation

Expand your input G in all possible ways

220 120 022



010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010

220 120 022


hx

21, hh

hx

yhh 21 ,


010 + 100, 000 + 110 100 + 110 000 + 011, 001 + 010

220 120 022


The resulting Integer Program (IP1):

Other ILP formulation are possible. E.g. POLY-SIZE ILP formulations


1101111101

0001111101

1101101101

1101111011

0001100011

OUTPUT: H = { 11011, 11101, 00011, 01101}

Each genotype is explained by two haplotypes

We will define some objectives for H

INPUT: G = { 11**1, ****1, 11011, *1**1, 00011 }

****1

11**1

OOO11

11O11

*1**1

1st Objective (open research problem):

minimize |H|

2nd Objective based on inference rule:

1st Objective (parsimony) :

minimize |H|

An easy SQRT(n) approximation: k haplotypes can explain at most k(k-1)/2 genotypes, hence, we need at least LB = SQRT(n) haplotypes.

BUT any greedy algorithm can find 2 haplotypes to explain a genotype, giving asolution of <= 2n haplotypes, i.e. <= SQRT(n) * LB

It’s difficult, but not impossible, to come up with better approximations, like constants(Lancia, Pinotti, Rizzi ’02)

2nd Objective based on inference rule:

xoxxooxoxx +********** =x??xoox?x?

known haplotype h


Inference Rule

xoxxooxoxx +xxoxooxxxo =x??xoox?x?

known haplotype h



Inference Rule

xoxxooxoxx +xxoxooxxxo =x??xoox?x?

known haplotype h



We write h + h’ = g

g and h must be compatible to derive h’

Inference Rule

2nd Objective (Clark, 1990)1. Start with H = nonambiguos genotypes2. while exists ambiguos genotype g in G3. take h in H compatible with g and let h + h’ = g4. set H = H + {h’} and G = G - {g}5. end while




ooooxooo??ooxx??




ooooxooo??ooxx??

xxoo




ooooxooo??ooxx??

xxoo xxxx SUCCESS




ooooxooo??ooxx??




ooooxooo??ooxx??

oxoo




ooooxooo??ooxx??

oxoo FAILURE (can’t resolve xx?? )

OBJ: find order of application rule that leaves the fewest elements in G

- Problem is APX-hard (Gusfield,00)

- Graph-Model + Integer Programming for practical solution (G.,01)



x??o?

1. expand genotypes



x??o?

xxxox

xxxooxxooxxxoooxoxox

xoooxxoxoo

xoooo

1. expand genotypes



x??o?

xxxox

xxxooxxooxxxoooxoxox

xoooxxoxoo

xoooo

2. create (h, h’) if exists g s.t. h’ can bederived from g and h

1. expand genotypes 3. Largest number of nodes in forest

rooted at unambiguos genotpes = = largest number of ambiguous genotypes resolved

Hence, find largest number of nodes in forest rooted at unambiguos genotpes. Use I.P. model with vars x(ij).

This reduction is exponential. Is there a better practical approach?

3rd Objective (open research problem)Disease Detection:

oooxx

??oxx

?x??x

????x

xx??x

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

3rd Objective (open research problem)Disease Detection:

xxoxxxxxox

oooxx

oooxxxxxox

xxoxxoxxox

xxoxxoooxx

oooxxoooxx

??oxx

?x??x

????x

xx??x

OUTPUT: H = { xxoxx, xxxox, oooxx, oxxox}

H contains H’, s.t. each diseased has one haplotype in H’ and each healty none minimize | H|

INPUT: G = { xx??x, ????x, ??oxx, ?x??x, oooxx }

Genome Rearrangements and Evolutionary Distances

Each species has a genome (organized in pairs of chromosomes)

tcgtgatggat………………ttgatggattga

tcgattatggat………………ttttgatatcca

Genomes evolve by means of

• Insertions• Deletions• Inversions• Transpositions• Translocations

of DNA regions

deletion

deletioninsertion

deletioninsertion

translocation

deletioninsertion

translocationinversion

deletioninsertion

translocationinversion

transposition

Combinatorial problem: given 2 permutations P, Q and operators in a set F find ashortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)(…still difficult…)Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transpositions…)

5 6 4 8 3 2 1 9 7Example:

We focus on inversions, that are the most important in Nature

1 2 3 8 4 6 5 9 7 1 2 3 8 4 5 6 9 7 1 2 3 6 5 4 8 9 7 1 2 3 6 5 4 8 7 9 1 2 3 4 5 6 8 7 9 1 2 3 4 5 6 7 8 9

Combinatorial problem: given 2 permutations P, Q and operators in a set F find ashortest sequence f1, ..fk of operators such that Q = fk(fk-1(…(f1(P))))

Very difficult problem! We focus on operators all of the same type (e.g. inversions)(…still difficult…)Wlog we can take Q = (1 2 … n). Hence we talk of sorting by … (inversions, transposition…)

+5 +6 -4 -8 -3 -2 -1 -9 +7Example:

We focus on inversions, that are the most important in Nature

+1 +2 +3 +8 +4 -6 -5 -9 +7+1 +2 +3 +8 +4 +5 +6 -9 +7+1 +2 +3 -6 -5 -4 -8 -9 +7+1 +2 +3 -6 -5 -4 -8 -7 +9+1 +2 +3 +4 +5 +6 -8 -7 +9+1 +2 +3 +4 +5 +6 +7 +8 +9

There is also a SIGNED VERSION of the problem !

(Unsigned) Sorting by Inversions is NP-hard (longstanding question, settled by Caprara ‘98)

Surprisingly, Signed Sorting by Inversions is Polynomial (beautiful theory, by Hannenhalli and Pevzner)

The complexity of Sorting by Transpositions, e.g., is unknown

5 7 8 2 1 4 3 6 9

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0 10







5 7 8 2 1 4 3 6 9

The concept of breakpoint

Breakpoint at position i if | p(i) - p(i+1) | > 1

0 10

d(p) = inversion distanceb(p) = # breakpoints

TRIVIAL BOUND: d(p) >= b(p) / 2

Example: d(p) >= 6 / 2 = 3

The Breakpoint Graph

5 7 8 2 1 4 3 6 9 0

10


5 7 8 2 1 4 3 6 9 0

10

10 64

Each node has degree...

0 2 or 4 …

hence the graph can be decomposed in cycles!


5 7 8 2 1 4 3 6 9 0

10

Alternating cycle decomposition


5 7 8 2 1 4 3 6 9 0

10

Alternating cycle decomposition

c(p) = max # cycles in alternating decomposition

VERY STRONG BOUND : d (p) >= b(p) - c(p)

Example: c(p)= 2 and d (p) >= 6 - 2 = 4


5 7 8 2 1 4 3 6 9 0

10

The best algorithm for this problem is based on an Integer Programmingformulation of the max cycle decomposition

A variable xC for each cycle (exponential # of vars…)

A constraint S xC = 1 for each edge e

Objective: maximize SC xC

C containing e

max S xCC

S xC = 1 for all edges eC\ni e

xC \in {0,1} for all alt. cycles C

PRIMAL

min S yee

S ye <= 1 for all alt. Cycles Ce\in C

ye \in R for all edges e

DUAL

5 7 8 2 1 4 3 6 9 0

10

Pricing out the cycles for which y*(C) < 1

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Split the graph in two copies

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Connect twins

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

A perfect matching corresponds to (a set of) alternating cycles

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

The weight of the matching is the y*-weight of the cycles

.2

.4

.5

1

.6

0

5 7 8 2 1 4 3 6 9 0

10

5 7 8 2 1 4 3 6 9 0

10

Forcing a cycle to use a certain node

.2

.4

.5

1

.6

100000

- These cycles would not use the same node twice, but with simple trick is possible to model (OMISSIS)

BRANCH&PRICE algorithm by Caprara, Lancia, Ng (1999,2001)

BRANCH&BOUND combinatorial algorithm by Kececioglu, Sankoff (1996)

KS can solve at most n=40. Take days for n=50

CLN can solve for n=200. Takes few seconds (say 5) for n=100

NP-hard problem practically solved to optimality!

Statistical view of evolution• Genome evolve by random inversions• It’s like a random walk on a huge graph with an edge for

each permutation an edge for each inversion• It is not clear why the shortest solution should be the

one followed by Nature (in fact, often it isn’t)• We want to find the most likely number of inversions

that lead from (1 2 … n ) to p• We use the expected number of breakpoints after k

inversions as a way to guess the # of inversions

Let B(k) be the (r.v.) number of breakpoint after k random inversions from (1..n)

Given a p obtained by h random inversions from (1 … n ) we want to estimate h

The inversion distance is only a lower bound: h >= d(p) but the gap could be big

We estimate E[B(k)]. Then, faced with some p, we pick h such that E[B(h)] is as close as possible to b(p) (maximum likelihood). CL ,2000, have shown:

Question: estimate E[D(k)], the (r.v.) inversion distance after k random inversions

E[B(k)] = ( n - 1 ) ( 1 - ( ) )

n - 3n - 1

k

Example: n = 200, k (u.a.r. in 1…n) inversions

8 8 8 1619 19 19 3468 67 67 9869 73 68 10473 79 73 10985 91 83 12086 85 83 11587 90 84 119118 117 109 138184 184 135 168

k k’ d(p) b

Protein Structure Alignments: the Maximum Contact Map Overlap

Problem

A Protein is a complex molecule with a primary, linear structure (a sequence of aminoacids) and a3-Dimensional structure (the protein fold).

Protein STRUCTURE determines its FUNCTION

For instance, the Drug Design problemcalls for constructing peptides with a 3Dshape complementary to a protein, so asto dock onto it.

Motivation:Structure Alignment is Important for:

- Discovery of Protein Function (shape determines function)

- Search in 3D data bases

- Protein Classification and Evolutionary Studies

- ...

Problem: Align two 3D protein structures

Contact Maps

Unfolded protein

CONTACT MAPS

Unfolded protein

Folded protein = contacts

CONTACT MAPS

Unfolded protein


Contact map = graph

CONTACT MAPS

CONTACT MAPS

Unfolded protein


Contact map = graph

OBJECTIVE: align 3d folds of proteins = align contact maps

Contact Map Alignments

Non-crossing Alignments

Protein 1

Protein 2

non-crossing map of residues in protein 1 and protein 2

The value of an alignment

Value = 3


Value = 3We want to maximize the value


NP-Hard


Integer Programming Formulation


0-1 VARIABLESyef for e and f contacts

e

f

yef


0-1 VARIABLES

yef + ye’f’ <= 1

yef for e and f contacts

e

f

yef

CONSTRAINTS

e

f

e’

f’


0-1 VARIABLES

yef + ye’f’ <= 1

yef for e and f contacts

e

f

yef

CONSTRAINTS

e

f

e’

f’

OBJECTIVE max SeSf yef

Independent Set ProblemIt’s just a huge max independent set problem in Gy:

• a node for each sharing • an edge for each pair of incompatible sharings

e

f

e’

f’f’’

e’’

ef

e’f’

e’’f’’

Independent Set ProblemIt’s just a huge max independent set problem in Gy:

• a node for each sharing • an edge for each pair of incompatible sharings

e

f

e’

f’f’’

e’’

ef

e’f’

e’’f’’

|Gy|=|E1|*|E2| (approximately 5000 for two proteins with 50 residues and 75 contacts each)

The best exact algorithm for independent set can solve for at most a few hundred nodes

Node to Node VariablesNew variables x provide an easy check for the non-crossing conditions

NEW VARIABLESxij for i and j residues

e

f

yef

i

jxij


NEW VARIABLESxij for i and j residues

e

f

yef

NEW CONSTRAINTS

i

j

i’

j’

xij + xi’j’ <= 1

i

jxij


NEW VARIABLES

y(ip)(jq) <= xij and y(ip)(jq) <= xpq

xij for i and j residues

e

f

yef

NEW CONSTRAINTS

i

j

i’

j’

xij + xi’j’ <= 1

i

jxij

i

j

p

q

Clique ConstraintsVariables x define a graph Gx:

• A node for each line• An edge between each pair of crossing lines

i

j

i’

j’

ij

i’j’

Clique ConstraintsVariables x define a graph Gx:

• Gx is much smaller than Gy• Gx has nice proprieties (it’s a perfect graph)• It’s easier to find large independent sets in Gx

• A node for each line• An edge between each pair of crossing lines

i

j

i’

j’

ij

i’j’

Clique ConstraintsNon-crossing constraints can be extended to

CLIQUE CONSTRAINTS

S xij <= 1[i,j] in M

For all sets M of mutually incompatible (i.e. crossing) lines

All clique constraints satisfied (and Gx perfect) imply a strong bound!

Structure of Maximal cliques in Gx

1. Pick two subsets of same size


2. Connect them in a zig-zag fashion


3. Throw in all lines included in a zig or a zag


The result is a maximal clique in Gx

Separation of Clique Inequalities

Separation of Clique InequalitiesPROBLEM

There exist exponentially many such cliques (O(22n) inequalities).

We need to generate in polynomial time a clique inequality when needed,i.e., when violated by the current LP solution x*

S x*ij > 1[i,j] in M

THEOREM

We can find the most violated clique inequality in time O(n2)

Separation of Clique InequalitiesPROOF (sketch)

1) Clique = zigzag path


1) Clique = zigzag path

1 2 3 4 5 6 7 8


1) Clique = zigzag path 2) Flip one graph: zigzag leftright

1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1


1) Clique = zigzag path 2) Flip one graph: zigzag leftright

1 2 3 4 5 6 7 8 8 7 6 5 4 3 2 1

3) Define a grid with lengths for arcs so that length(P) = x*(clique(P)). Use Dyn. Progr.to find longest path in grid, time O(n^2)

Separation of cliques

n2

1n11 2

2

i

u

Create n1 x n2 gridOrient all edges and give weights


n2

1n11 2

2

i

u

Create n1 x n2 gridOrient all edges and give weights

x*iu

x*iu


Create n1 x n2 gridOrient all edges and give weightsThere is violated clique iff longest A,B path has length > 1

A=(1,n2)

B=(n1,1)

Gx is a Perfect Graph

We show why polynomial separation is possible:

Gx is weakly triangulated (no chordless cycles >= 5 in Gx or Gx)

=> Gx is perfect (Hayward, 1985)


L1

L2

L3

L4

L7

L6

L5

PROOF (Sketch, for Gx)

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)


L1

L2

L3

L4

L7

L6

L5L1 L3

L1 and L3 don’t cross. Wlog RIGHT(L3, L1)


L1

L2

L3

L4

L7

L6

L5L1 L3

For i=4,5,… Li crosses Li-1 but not L1=> RIGHT (Li, L1)


L1

L2

L3

L4

L7

L6

L5L1 L3


L4


L1

L2

L3

L4

L7

L6

L5


L1

L4L5


L1

L2

L3

L4

L7

L6

L5


L1 L5L6


L1

L2

L3

L4

L7

L6

L5L1

We get LEFT(L1, {L3, L4, L5, L6})

L3, L4, L5 L6

L6


L1

L2

L3

L4

L7

L6

L5L1

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L3, L4, L5 L6

L6


L1

L2

L3

L4

L7

L6

L5L1

A symmetric argument started at L6, with LEFT(L1, L6) implies LEFT(Li, L6) for i=2,3,4,5

L3, L4, L5 L6

L6

L2, L3, L4 L5


L1

L2

L3

L4

L7

L6

L5L1

Then {L3, L4, L5} are between L1 and L6

L3, L4, L5 L6

L6

L2, L3, L4 L5


L1

L2

L3

L4

L7

L6

L5L1

Then {L3, L4, L5} are between L1 and L6

L3, L4, L5 L6

L6

L2, L3, L4 L5

But L7 crosses L1 and L6, and so should cross them all !

L7

The approach just seen is due to Lancia, Carr, Istrail, Walenz (2001)It can be applied to small or moderate proteins (up to 80 residues/150 contacts)

In 2002, a new approach, by Caprara and Lancia, based on LAGRANGIANRELAXATION. Approach borrowed from Quadratic Assignment. With newapproach we can solve important proteins (up to 150 residues/300 contacts)

What about Heuristics?E.g., genetic algorithms…

Genetic Algorithm Overview

• A Population of candidate solutions thatevolve (improve) over time

• Recombination creates new candidate solutions viacrossover and mutation

Populationat time t

Populationat time t+1

Recombinationoperators

Evaluationfunction

Crossover• Crossover selects pieces from both parents and creates two

offspring solutions

Blue Parent

Offspring

Red Parent


offspring solutions– Select a set of edges in one parent to copy to the child


offspring solutions– Select a set of edges in one parent to copy to the child– Copy as many edges as possible from the other parent


offspring solutions– Select a set of edges in one parent to copy to the child– Copy as many edges as possible from the other parent

These edges conflict with existingedges and are not copied


offspring solutions– Select a set of edges in one parent to copy to the child– Copy as many edges as possible from the other parent– Add random edges to fill any remaining space

Mutation• Mutation introduces small changes to existing solutions by

shifting edge endpoints


shifting edge endpoints– Select a set of endpoints to shift


shifting edge endpoints– Select a set of endpoints to shift

This edge “fell off” theend of the contact map

and is removed


shifting edge endpoints– Select a set of endpoints to shift– Randomly add new edges

Computational Results

Computational Results

• 269 proteins– 70 -100 residues– 80 to 140 contacts

• Picked 10,000 pairs of proteins out of 36046 possible• Took a weekend on PC• 500 were solved to optimality• 2500 had a gap <= 10 contacts

Skolnick Clustering Test

Skolnick Results• Four Families

1 Flavodoxin-like fold Che-Y related2 Plastocyanin3 TIM Barrel4 Ferratin

• alpha-beta• 8 structures• up to 124 residues• 15-30% sequence similarity• < 3Å RMSD



• beta• 8 structures• up to 99 residues• 35-90% sequence similarity• < 2Å RMSD



• alpha-beta• 11 structures• up to 250 residues• 30-90% sequence similarity• < 2Å RMSD



• alpha• 6 structures• up to 170 residues• 7-70% sequence similarity• < 4Å RMSD

Skolnick Results

Family Style Residues Seq. Sim. RMSD Proteins1 alpha-beta 124 15-30% < 3A 1b00, 1dbw, 1nat, 1ntr,

1qmp, 1rnl, 3cah, 4tmy2 beta 99 35-90% < 2A 1baw, 1byo, 1kdi, 1nin,

1pla, 3b3i, 2pcy, 2plt3 alpha-beta 250 30-90% < 2A 1amk, 1aw2, 1b9b, 1btm,

1hti, 1tmh, 1tre, 1tri,1ydv, 3ypi, 8tim

4 170 7-70% < 4A 1b71, 1bcf, 1dps, 1fha,1ier, 1rcd

• Four Families1 Flavodoxin-like fold Che-Y related2 Plastocyanin3 TIM Barrel4 Ferratin

Clustering

Define score(P1, P2) as

0 <= # shared contacts

Min # of contacts of P1,P2

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold

Clustering

Define score(P1, P2) as

0 <= # shared contacts

Min # of contacts of P1,P2

<= 1

Put P1, P2 in same family if score(P1, P2) >= threshold

If P1, P2 too big, use G.A. and local search to compute score

L.P. gives then bounds:

HEUR score <= OPT score <= LP bound

and we know how far off OPT we are

Clustering validationWe got some known families from biologists, PDB.

Experiment: Take a family F of proteins and align them against each other and against the remaining.

Clustering validationWe got some known families from biologists, PDB.

0.05 MISMATCH0.1 MISMATCH0.15 MISMATCH0.2 MISMATCH0.25 MISMATCH0.3 MISMATCH0.35 MATCH…… ……1.0 MATCH

score proteins were…

Experiment: Take a family F of proteins and align them against each other and against the remaining.

TYPICAL BEHAVIOUR

Skolnick Results• Performance

– 528 alignments– 1.3% false negative– 0.0% false positive

Clustering

Computed, for 1st time, provably optimal alignments for 150 pairs(inter-family)

Used the CMO value to cluster: retrieves the clusters.

Set S(i,j) = 1 if CMO >= a, S(i,j) = 0 otherwise

Use TSP to find a block diagonal structure for S

Clustering

Last Open Problem

? ?

The Population Haplotyping problem

Documents

Transcript of The Population Haplotyping problem