Two Solutions in Search of Killer Apps.

55
Two Solutions in Search of Killer Apps. Dimacs workshop on Algorithms in Human Population Genomics Dan Gusfield UC Davis

description

Two Solutions in Search of Killer Apps. Dimacs workshop on Algorithms in Human Population Genomics Dan Gusfield UC Davis. Two Algorithmic Topics. We have new algorithmic tools for a) computing the Minimum Mosaic of a set of recombinants, and for b) multi-state Perfect - PowerPoint PPT Presentation

Transcript of Two Solutions in Search of Killer Apps.

Page 1: Two Solutions in Search of Killer Apps.

Two Solutions in Search of Killer Apps.

Dimacs workshop on Algorithms in Human Population Genomics

Dan GusfieldUC Davis

Page 2: Two Solutions in Search of Killer Apps.

Two Algorithmic Topics

We have new algorithmic tools for a) computing the MinimumMosaic of a set of recombinants, and for b) multi-state PerfectPhylogeny with missing data, that should be of use inPopulation Genomics and Phylogenetics. These toolswere developed `on spec’:

We have `hand-waiving’ arguments for their utility, but no actual (biological data-set) applications.

Suggestions wanted.

Page 3: Two Solutions in Search of Killer Apps.

Topic I: Improved Algorithms for Inferring the Minimum Mosaic of a

Set of Recombinants

Yufeng Wu and Dan GusfieldUC Davis

From CPM 2007

Page 4: Two Solutions in Search of Killer Apps.

Recombination• Recombination: one of the principle genetic forces shaping

sequence variations within species.• Two equal length sequences generate a new equal length

sequence.

110001111111001

000110000001111Prefix

Suffix

11000 0000001111

Breakpoint

Page 5: Two Solutions in Search of Killer Apps.

Founders and Mosaic• Current sequences are descendents of a small

number of founders.– A current sequence is composed of blocks from the

founders, due to recombination.– No mutations since formation of founders.

000000

111111

000000

111111

001111

000000

111111

001111

111100Breakpoint

Founders

Sampled sequences in current population

000000

001111

111100

011100

Mosaic

000000

001111

111100

011100

Page 6: Two Solutions in Search of Killer Apps.

The Minimum Mosaic Problem• Given a set of aligned binary sequences in the current

population and assume the number of founders is known to be Kf, find set of founders and the mosaic with the minimum number of breakpoints.

1101101

1010001

0111111

0110100

1100011

Assume Kf =3

1101101

1010001

0111111

0110100

1100011

1101111

1010001

0110100

Three Founders

Four breakpoints: minimum for all possible three founders

Page 7: Two Solutions in Search of Killer Apps.

Status of the Minimum Mosaic Problem

• First studied by E. Ukkonen (WABI 2002). Later WABI 2007.– Dynamic programming method. Not practical when the

number of rows is more than 20 and Kf >2.• No polynomial-time algorithm was known even when

Kf is small. No NP-completeness result is known.• Our results:

– A simple polynomial-time algorithm for Kf = 2 case. – Exact and practical method for data of medium range for Kf

3.

Page 8: Two Solutions in Search of Killer Apps.

10

Three or More Founders: Assuming Known Founders

1101101

1010001

0111111

0110100

1100011

Three Founders

1101101

1101111

1010001

0110100

With known founders, can minimize breakpoints for each sequence, and thus also minimize the total number of breakpoints.

For each input sequence, starting from the left, insert a breakpoint at the end of longest segments matching one founder.

Founder mapping: at each position c in any input sequence s, which founder s[c] takes its value from.

Breakpoint!

Input Sequences

1101101Founder 1 Founder 2

Founder Mapping

Page 9: Two Solutions in Search of Killer Apps.

Enumerating Founders for Founder-Unknown Case

In reality, founders are not known. A straightforward way is to simply enumerate all possible sets of founders, and then run the previous method to find the minimum mosaic.

100

001

011

101

110

010

At each column, there are 2kf–2 founder settings.

Let m be the number of columns, fully enumerate all possible sets of founders takes (2m*kf) time. Infeasible when m or Kf is large.

Need more ideas to develop a practical method. First, we do the enumeration in the form of search paths in a search tree.

Page 10: Two Solutions in Search of Killer Apps.

Search Paths and Search Tree

It works but exponential blowup of the search paths!

Obvious idea to reduce search space: branch and bound (compute a lower bound and …).

But we found a different idea is more useful.

001

0

Founder setting at column one

Num of tot. breakpoints up to current column

011

0

c1

c3001

2

010

1

c2001

001

100

0001

011

1001

101

0001

110

2001

010

5

On-line computation:

Compute partial solution up to the current column for speedup.

010

001

Founder settings up to column 3

The founder-known method can be run with partially-known founders!

Assume three founders

Page 11: Two Solutions in Search of Killer Apps.

Dropping Search Paths that are Beaten by Another Search Path

001

0

011

6

P1 and P2 are two search paths up to column 2.

Can we say P1 is better than P2? Not really, because maybe P2 can lead to fewer breakpoints later on.

But, suppose the number of input sequences is 5. We can then say P1 beats P2 (and so drop P2). Why?

P1

P2

<=39<= 5 bkpts

>= 0 bkpts

An optimal search path following P2

40

Assume three founders

011

101

Founder Config.

Page 12: Two Solutions in Search of Killer Apps.

A more powerful beating rule

We use a more powerful, but more complex, rule to identify paths that will be beaten, and we use rules that avoid generating beaten paths and redundant, symmetric paths.

These methods reduce the enumeration enormously, allowing practical computation of the optimal.

Page 13: Two Solutions in Search of Killer Apps.

How Practical Is Our Method?Source of data and image: UNC Chapel Hill

Five founders

20 rows, 36 columns

UNC’s heuristic solution: 54 breakpoints

Enumerating 2180 founder states is impossible!

Our method takes 5 minutes to find the optimal solutions: 53 breakpoints. It is also practical for 50x50 matrix with four founders.

Page 14: Two Solutions in Search of Killer Apps.

Another example

The data from Ukkonen’s 2007 WABIpaper (4 founders, 20 sequences, 40 sites

was solved in 5 secs and usedone fewer crossover than used in thatpaper.

Page 15: Two Solutions in Search of Killer Apps.

Applications?

• Founders on an island• Founders in microbial communities• ???

Page 16: Two Solutions in Search of Killer Apps.

Topic II: Multi-State Perfect Phylogeny

with Missing and Removable Data

To appear in Recomb 2009, May 09

Dan Gusfield

Page 17: Two Solutions in Search of Killer Apps.

The Perfect Phylogeny Modelfor binary sequences

000001

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Only one mutation per siteallowed (infinite sites)

Page 18: Two Solutions in Search of Killer Apps.

What is a Perfect Phylogeny for k-state characters?

• Input consists of n sequences M with m sites (characters) each, where each site can take one of k > 2 states.

• In a Perfect Phylogeny T for M, each node of T is labeled with an m-length sequence where each site has a value from 1 to k.

• T has n leaves, one for each sequence in M, labeled by that

sequence.• Arbitrarily root T at some node, and direct all the edges away from

the root. Then, for any character C with b states, there are at most b-1 edges where character C mutates, and for any

state s of C, there is at most one edge where character C mutates to state s. This more reflects the infinite alleles model rather than infinite sites.

Page 19: Two Solutions in Search of Killer Apps.

Example: A perfect phylogeny for input M

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5m = 3k = 3

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

Root

Page 20: Two Solutions in Search of Killer Apps.

A more standard definition

• For each character-state pair (C,s), the nodes of T that are labeled with state s for character C, form a connected subtree of T.

• It follows that the subtrees for any C are node-disjoint. This condition is called the convexity requirement.

Page 21: Two Solutions in Search of Killer Apps.

Example

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

A B C

1

2

3

4

5

Mn = 5 number of taxam = 3 number of sitesk = 3 number of states

(3,2,1)(2,3,2)

(3,2,3)

(1,2,3)(1,1,3)

(1,2,3)

(3,2,3)

The tree forState 2 ofCharacter B

Page 22: Two Solutions in Search of Killer Apps.

Perfect Phylogeny Problems

Existence Problem:Given M, is there a Perfect Phylogeny for M?

Missing Data Problem: For a given k, if there are cells in M withoutvalues, can values less than or equal to k be imputedso that the resulting matrix M’ has a perfect phylogeny?

Handling missing data extends the utility of the perfect-phylogeny model.

Page 23: Two Solutions in Search of Killer Apps.

Status of the Existence Problem

Poly-time algorithm for 3 states, Dress-Steel 1991

Poly-time algorithm for 3 or 4 states, Kannan-Warnow

Poly-time algorithm for any fixed number of states -polynomial in n and m, but exponential in k, Agarwalla andFernandez-Baca

Speed up of the method by Kannan-Warnow

When k is not fixed, the existence problem is NP-hard

Page 24: Two Solutions in Search of Killer Apps.

Status of missing data problem

NP-complete even for k = 2; effective integer programmingapproaches for k = 2.

Polynomial-time methods for a `directed’ variant of k = 2.

No literature on the missing data problem for k > 2.

New work here: specialized ILP methods for k = 3,4,5and a general ILP solution for any fixed k.In this talk I will only discuss the general solution.

Page 25: Two Solutions in Search of Killer Apps.

New approach to existence and missing data

Based on an old theorem and newer techniques.

Old theorem: Buneman’s theorem relating Perfect-Phylogeny to chordal graphs.

Newer techniques and theorems: Minimal triangulations of anon-chordal graph to make it chordal.

Page 26: Two Solutions in Search of Killer Apps.

Chordal Graphs

A graph G is called Chordal if every cycle of length four or more contains a chord.

Page 27: Two Solutions in Search of Killer Apps.

Another Classic (1970s) Characterization

A graph G is chordal if and only if it is the intersection graph ofa set S of subtrees of a tree T. Each node of G is a member of S.

a

b c

d

e f

g

{b,c}

{b,c,d}

{c,d,e,g}

{a,e} {e,f,g}

T

{a,e,g}

G

Page 28: Two Solutions in Search of Killer Apps.

Relation to Perfect Phylogeny

In a perfect phylogeny T for a table M, for any character Cand any state s of character C, the sub-forest of Tinduced by the nodes labeled (C,s) form a single, connectedsubtree of T.

So, there is a natural set of subtrees of T induced by M, andhints at the relationship of perfect-phylogeny to chordalGraphs.Buneman’s theorem makes this precise.

Page 29: Two Solutions in Search of Killer Apps.

Buneman’s Approach to Perfect Phylogeny

3

2 1

2 3 2 3 2 3 1 1 3 1 2 3

1 1 1 2 2 23 3 3

Each row of table M induces a clique in Partition-intersection graph G(M).

Table M

Partition-Intersection Graph G(M) has one node for eachcharacter-state pairin M, and an edgebetween two nodesif and only if thereis a row in M withboth thosecharacter-statepairs.

G(M)

1 2 3Character-states

Page 30: Two Solutions in Search of Killer Apps.

Note that if table M has m columns, then G(M) is am-partite graph. Nodes in the same class of G(M)are said to have the same color. Two nodes with theSame color are called a mono-chromatic pair.

An edge (u,v) not in G(M) is legalif u and v are in different classes of the partition, ie.are not a mono-chromatic pair.

Page 31: Two Solutions in Search of Killer Apps.

Buneman’s Theorem

There is a perfect phylogeny for M if and only if legaledges can be added to graph G(M) to make it chordal.

If there is such a chordal graph, denote it G’(M).

Theorem (Buneman 1971?)

G’(M) is called a legal triangulation of G(M).

Page 32: Two Solutions in Search of Killer Apps.

From Chordal Graph to Perfect Phylogeny

Fact: Given a legal triangulation G’(M), a perfect phylogenyfor M can be constructed in linear time.

The algorithms are based on `perfect elimination orders’ And `clique trees’. Many citations.

Page 33: Two Solutions in Search of Killer Apps.

Example

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

Page 34: Two Solutions in Search of Killer Apps.

A legal triangulation

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

X Y

Page 35: Two Solutions in Search of Killer Apps.

Yields a Perfect Phylogeny

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

B C

A D

One node in T for eachmaximal clique in G’(M)

X Y

002

010 111

122

012 112

Page 36: Two Solutions in Search of Killer Apps.

What about Missing Data?

If M is missing data, build the partition intersection graph G(M) using the known data in M. Buneman’s theorem still holds:

There is a perfect phylogeny for some imputation of missingdata in M, if and only if there is a legal triangulation of G(M).

The legal triangulation gives a perfect phylogeny T for Mwith some imputed data, and then the imputed M’ can beobtained from T.

Page 37: Two Solutions in Search of Killer Apps.

ExampleA: 0 0 2B: 0 ? 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

Page 38: Two Solutions in Search of Killer Apps.

The Key Problem

So the key problem in this approach to both theExistence and the Missing Data problems is howto find a legal triangulation, if there is one.

But, there is a robust and still expanding literature onefficient algorithms to find a minimal triangulation ofa non-chordal graph.

Some triangulation problems are NP-hard (Tree-width,Minimum fill-in).

Page 39: Two Solutions in Search of Killer Apps.

Minimal triangulation

A triangulation of a non-chordal graph G is minimal if no subset of added edges is a triangulationof G.

Clearly, if there is a legal triangulation G’(M) of G(M),Then there is one which is a minimal triangulation.

So we can take advantage of the minimal triangulationtechnology. The minimal vertex separators are the key objects.

Page 40: Two Solutions in Search of Killer Apps.

Minimal vertex separatorsA set of vertices S whose removal separates verticesu and v is called a u,v separator. S is a minimal u,vseparator if no subset of S is a u,v separator.

S is a minimal-separator if it is a minimal u,v separatorfor some u,v.

Minimal separator S crosses minimal separator S’, ifS separates some pair of nodes in S’.

Crossing is a symmetric relation for minimal separators.

Page 41: Two Solutions in Search of Killer Apps.

Example

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G(M)

{(2,1), (3,2)} and {(1,0), (1,1)}are crossing minimal separators.

{(2,1), (1,1)} and {(1,0), (3,2)} arenon-crossing minimal separators.

Page 42: Two Solutions in Search of Killer Apps.

The structure of a minimal triangulation in G

Completing a minimal separator K means adding allthe missing edges to make K a clique.

Capstone Theorem (P,S 1997): Every minimal triangulation of G is obtained by completing each minimal separator in amaximal set of pairwise non-crossing minimal separators of G. Conversely, completing every minimal separator ina maximal set of pairwise non-crossing minimal separatorsyields a triangulation of G.

Page 43: Two Solutions in Search of Killer Apps.

A legal triangulation

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

X Y

There are6 minimalseparators.5 are pairwisenon-crossing

Page 44: Two Solutions in Search of Killer Apps.

Back to Perfect PhylogenyA minimal separator S in the partition intersection graph G(M)Is called legal if it does not contain two nodes of the samecolor and illegal if it does.

P,S Theorem can be used to prove the Main New Results

Theorem 1:There is a perfect phylogeny for M, even if M is missing data,If and only if there is a set of pairwise non-crossing legalminimal separators in G(M) that separate every mono-chromatic pair of nodes in G(M).

Page 45: Two Solutions in Search of Killer Apps.

A legal triangulation

A: 0 0 2B: 0 1 0C: 1 1 1D: 1 2 2

1 2 3

M

3,0 2,1 3,1

1,0 1,1

2,0 3,2 2,2

B C

A D

G’(M)

X Y

Page 46: Two Solutions in Search of Killer Apps.

Corollaries

Cor 1: If there is a mono-chromatic pair of nodes in G(M)that is not separated by any legal minimal separator, thenM has no perfect phylogeny.

Cor 2: If G(M) has no illegal minimal separators, thenM has a perfect phylogeny.

Cor 3: If every mono-chromatic pair of nodes is separatedby some legal minimal separator, and no legal minimalseparators cross, the M has a perfect phylogeny.

Page 47: Two Solutions in Search of Killer Apps.

How to solve the existence and missing data problems

Given M, find all minimal separators in G(M); determine which are legal and which are illegal; for each legalminimal separator, determine which mono-chromatic pairsof nodes it crosses.

Determine if any of the Corollaries hold.

If not, set up and solve an integer linear program to find a set Q ofpairwise non-crossing legal minimal separators thatseparate every mono-chromatic pair of nodes in G(M).

Page 48: Two Solutions in Search of Killer Apps.

Conceptually nice, but

Does it work in practice?

Page 49: Two Solutions in Search of Killer Apps.

It works surprisingly (shockingly) well

Simulations based on ms, with datasets of sizes thatare charactoristic of many current applications in phylogenetics and population genetics - but notgenomic scale or tree-of-life scale.

Page 50: Two Solutions in Search of Killer Apps.

Surprising empirical resultsThe minimal separators are found quickly by existing algorithms.

The numbers of minimal separators are small.

There are few crossing minimal separators.

Until a large percentage of missing data, most problemsare solved by the Corollaries, without the need for an ILP.

The created ILP are tiny.

The ILPs solve quickly - all havesolved in 0.00 CPLEX-reported seconds (CPLEX 11 on2.5 Ghz machine). Most solve in the CPLEX pre-processor.

Page 51: Two Solutions in Search of Killer Apps.
Page 52: Two Solutions in Search of Killer Apps.

So Although the chordal graph approach

may at first seem impractical, it works on a large range of data of sizes that are typical of current phylogenetics

problems, and degree of missing data.So missing data can be handled.

But what are the biological applications of Multi-StatePerfect Phylogeny?

Page 53: Two Solutions in Search of Killer Apps.

Application Requirements

• Must be multi-state data - ubiquitous• The probability of mutating to any given

state more than once must be very small - less common.

Page 54: Two Solutions in Search of Killer Apps.

Possible applications

• Micro-satellite data• Transposable elements as characters

and positions of elements as states• Discretized quantitative traits• Infinite alleles model

Page 55: Two Solutions in Search of Killer Apps.

All software to replicate theseresults will be available on my

website by the time of Recomb 2009