Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

93

description

Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination. Dan Gusfield UC Davis. Different parts of this work are joint with Satish Eddhu, Charles Langley, Dean Hickerson, Yun Song, Yufeng Wu. Reconstructing the Evolution of Binary Bio-Sequences. - PowerPoint PPT Presentation

Transcript of Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Page 1: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination
Page 2: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with

RecombinationDan Gusfield

UC DavisDifferent parts of this work are joint with Satish Eddhu, CharlesLangley, Dean Hickerson, Yun Song, Yufeng Wu.

Page 3: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Reconstructing the Evolution of Binary Bio-Sequences

• Perfect Phylogeny (tree) model• Phylogenetic Networks (DAG) with recombination• Phylogenetic Networks with disjoint cycles: Galled-

Trees• Phylogenetic Networks with unconstrained cycles:

Blobbed-Trees• Combinatorial Structure and Efficient Algorithms• Efficiently Computed Lower and Upper bounds on

the number of recombinations needed

Page 4: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Geneological or Phylogenetic Networks

• The major biological motivation comes from genetics and attempts to reconstruct the history of recombination in populations.

• Also relates to phylogenetic-based haplotyping.• Some of the algorithmic and mathematical results

have phylogenetic applications, for example in hybrid speciation, lateral gene transfer.

Page 5: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Why binary?

Single nucleotide polymorphisms are the keydata that we use. SNPs imply that the sequences are binary, and also that the order of the sites isfixed (on a chromosome). This is in contrast toa set of taxonomic characters, which may be binary, but where the given orderis arbitrary.

Page 6: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

The Perfect Phylogeny Model for binary sequences

00000

1

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Page 7: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

The converse problem

Given a set of sequences M we want to find, if possible,a perfect phylogeny that derives M. Remember thateach site can change state from 0 to 1 only once.

n will denote the number of sequences in M, and m willdenote the length of each sequence in M.

n

m

M 011010011110010110101011

Page 8: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:

0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test

When can a set of sequences be derived on a perfect phylogeny?

Page 9: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

A richer model

00000

1

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 added

pair 4, 5 fails the threegamete-test. The sites 4, 5``conflict”.

Real sequence histories often involve recombination.

Page 10: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

10100 01011

5

10101

The first 4 sites come from P (Prefix) and the sitesfrom 5 onward come from S (Suffix).

P S

Sequence Recombination

A recombination of P and S at recombination point 5.

Single crossover recombination

Page 11: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Network with Recombination

00000

1

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 new

10101

The previous tree with onerecombination event now derivesall the sequences.

5

P

S

Page 12: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Multiple Crossover Recombination

4-crossovers

2-crossovers = ``gene conversion”

Page 13: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Elements of a Phylogenetic Network (single crossover

recombination)• Directed acyclic graph. • Integers from 1 to m written on the edges. Each integer

written only once. These represent mutations.• A choice of ancestral sequence at the root.• Every non-root node is labeled by a sequence obtained

from its parent(s) and any edge label on the edge into it.• A node with two edges into it is a ``recombination node”,

with a recombination point r. One parent is P and one is S.• The network derives the sequences that label the leaves.

Page 14: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

A Phylogenetic Network

00000

52

3

3

4S

p

PS

1

4

a:00010

b:10010

c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

Page 15: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Which Phylogenetic Networks are meaningful?

Given M we want a phylogenetic network that derives M, but which one?

A: A perfect phylogeny (tree) if possible. As little deviationfrom a tree, if a tree is not possible. Use as little recombinationor gene-conversion as possible.

Page 16: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Minimizing recombinations

• Any set M of sequences can be generated by a phylogenetic network with enough recombinations, and one mutation per site. This is not interesting or useful.

• However, the number of (observable) recombinations is small in realistic sets of sequences. ``Observable” depends on n and m relative to the number of recombinations.

• Two algorithmic problems: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations (Hein’s problem). Find a network generating M that has some biologically-motivated structural properties.

Page 17: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Minimization is NP-hard

The problem of finding a phylogenetic network that creates a given set of sequences M, and minimizes the number of recombinations, is NP-hard. (Wang et al 2000) (Semple 2004)

Wang et al. explored the problem of finding a phylogenetic network where the recombination cycles are required to be node disjoint, if possible.

They gave a sufficient but not a necessary condition to recognize cases when this is possible. O(nm + n^4) time.

Page 18: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Recombination Cycles

• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.

• The cycle specified by those two paths is called a ``recombination cycle”.

Page 19: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Galled-Trees

A recombination cycle in a phylogenetic network is called a “gall” if it shares no node with any other recombination cycle.

A phylogenetic network is called a “galled-tree” if every recombination cycle is a gall.

Page 20: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

A galled-tree generatingthe sequences generatedby the prior network.

3

4

p s

ps

Page 21: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination
Page 22: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Sales pitch for Galled-TreesGalled-trees represent a small deviation from true trees.

There are sufficient applications where it is plausible that a galled tree exists that generates the sequences.

Observable recombinations tend to be recent; block structure of human DNA; recombination is sparse, so the true history of observable

recombinations may be a galled-tree.

The number of recombinations is never more than m/2. Moreover,when M can be derived on a galled-tree, the number of recombinations used is the minimum number over any phylogenetic network, even if multiple cross-overs at a recombination event are counted as a single recombination.

A galled-tree for M is ``almost unique” - implications for reconstructing thecorrect history.

Page 23: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Old (Aug. 2003) Results

• O(nm + n^3)-time algorithm to determine whether or not M can be derived on a galled-tree with all-0 ancestral sequence.

• Proof that the galled-tree produced by the algorithm is a “nearly-unique” solution.

• Proof that the galled-tree (if one exists) produced by the algorithm minimizes the number of recombinations used, over all phylogenetic-networks with all-0 ancestral sequence.

Page 24: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

New work

We derive the galled-tree results in a more general setting that addresses unconstrained

recombination cycles and multiplecrossover recombination. This also solves theproblem of finding the ``most tree-like” network when a perfect phylogeny is not possible. In this algorithm, no ancestral

sequence is known in advance.

Page 25: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Blobbed-trees: generalizing galled-trees

• In a phylogenetic network a maximal set of intersecting cycles is called a blob.

• Contracting each blob results in a directed, rooted tree, otherwise one of the “blobs” was not maximal.

• So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree.

The blobs are the non-tree-like parts of the network.

Page 26: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Ugly tanglednetwork insidethe blob.

Every network is a tree of blobs. How do the tree partsand the blobs relate?

How can we exploitthis relationship?

Page 27: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Incompatible Sites

A pair of sites (columns) of M that fail the

4-gametes test are said to be incompatible.

A site that is not in such a pair is compatible.

Page 28: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

2 5

Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.

Incompatibility Graph

M

THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

Page 29: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

The connected components of G(M) are very informative

• The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network (Bafna, Bansal; Gusfield, Hickerson).

• When each blob is a single-cycle (galled-tree case) all the incompatible sites in a blob must come from a single connected component C, and that blob must contain all the sites from C. Compatible sites need not be inside any blob. (Gusfield et al 2003-5)

Page 30: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Simple Fact

If sites two sites i and j are incompatible, then the sites must be together on some recombination cycle whose recombination point is between the two sites i and j.

(This is a general fact for all phylogenetic networks.)

Ex: In the prior example, sites 1, 3 are incompatible, as are 1, 4; as are 2, 5.

Page 31: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

A Phylogenetic Network

00000

52

3

3

4S

p

PS

1

4

a:00010

b:10010

c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

Page 32: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Simple Consequence of the simple fact

All sites on the same (non-trivial) connected component of the incompatibility graph

must be on the same blob in any blobbed-tree.Follows by transitivity.

So we can’t subdivide a blob into a tree-like structure if it only contains sites from a single connected component of the incompatibility graph.

Page 33: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Key Result about Galls: For galls, the converse of the simple

consequence is also true.Two sites that are in different (non-trivial) connected

components cannot be placed on the same gall inany phylogenetic network for M.

Hence, in a galled-tree T for M each gall contains all and only the sites of one (non-trivial) connected component of the incompatibility graph. All compatible sites can be put on edges outside of the galls.

This is the key to the galled-tree solution.

Page 34: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

A galled-tree generatingthe sequences generatedby the prior network.

2

4

p s

ps

1 3

4

2 5

Incompatibility Graph

Page 35: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Motivated by the one-one correspondence between gallsand non-trivial connected components, we ask:

To what extent does this one-one correspondence holdin general blobbed-trees, i.e. with no constraints onhow recombination cycles interweave?

Page 36: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

The Decomposition Theorem (Recomb 2005)

For any set of sequences M, there is a blobbed-tree T(M) that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob.

A blobbed-tree with this structure is called fully-decomposed.

Page 37: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

General Structure

So, for any set of sequences M, there is a phylogenetic network T(M) that is fully decomposed.

Moreover, the tree part of T(M) is unique. And it is easy to find the tree part.

Page 38: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

A fully-decomposed network for the sequences generated by the prior network.

2

4

p s

ps

1 3

4

2 5

Incompatibility Graph

Page 39: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Moreover

Since all sites from a single connected component must be together on some blob in any phylogenetic network, no network is more decomposed than the fully decomposed network.

Page 40: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Proof Ideas

Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C.

Page 41: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

M

a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 0 1 0

1 3 4

C1

2 5

C2

abcdefg

0 00 00 00 01 01 10 1

2 5

B1 B2

M[C1] M[C2]

Page 42: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Faux Proof

Pick one site from each connected component C in G(M) to ``represent” C. No pair of those sites are incompatible, so by the NASC for a perfect phylogeny, there will be a perfect phylogeny T for the sites. Expand each node to a network generating the sequences in M[C].

Incorrect, because the structure of T can be wrong. We need to use information about all the sites in each C.

Page 43: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0

1 3 4abcdefg

0 00 00 00 01 01 10 1

2 5

M[C1] M[C2]

abcdefg

1 0 0 0 1 0 0 00 1 0 0 1 0 0 00 0 1 0 1 0 0 00 0 0 1 1 0 0 00 0 1 0 0 1 0 00 0 1 0 0 0 1 00 0 1 0 0 0 0 1

W

1 2 3 4 5 6 7 8

1234333

5555678

Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicatormatrix for the supercharacters. So W indicates which rows of Mcontain which particular supercharacters.

Page 44: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Proof Ideas

Lemma: No pair of supercharacters are incompatible.

So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W.

Page 45: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Proof Ideas

For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M.

Page 46: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Algorithmically, T is easy to find and is the tree resulting from contracting each blob in the fully-decomposed blobbed-tree T(M) for M. T can be constructed from M in O(nm^2) time.

Page 47: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Broader Biological Applications

Our major interest is in recombination, but the proof of the decomposition theorem does not explicitly use recombination. So it holds for whatever biological phenomena caused the incompatibility of sites. For example, back or recurrent mutation, gene-conversion, lateral gene transfer etc.

Page 48: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

The main open question

The Decomposition Theorem says there is always a fully-decomposed blobbed-tree for any M, but

Is there always a fully-decomposed blobbed-tree that minimizes the number of recombinations over all possible phylogenetic networks for M?

Page 49: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

We conjecture the answer is yes.

If true, then we can decompose the problem of minimizing the total number of recombinations into separate problems on each connected component, and also find lower bounds on the needed number of recombinations, in each component separately, adding those bounds to get a valid overall lower bound for M. This computation of lower bounds is known to be correct for certain lower bounds (Bafna, Bansal 2004).

Page 50: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Progress on Proving the Conjecture

Definition: If N is a phylogenetic network for M, and a node v in N is labeled with a sequence in M, then v is said to be visible in N.

Theorem: If every node in N is visible, then there is a fully-decomposed network for M where the number of recombinations is at most the number in N.

Corollary: The conjecture is true for any M where the Haplotype or History lower bounds (S. Myers) on the number of recombinations needed to generate M, is tight.

Page 51: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Optimality of Galled-Trees

Theorem: (G,H,B,B) The minimum number of recombination nodesin any phylogenetic network for M is at least the number of non-trivial connected components of the incompatibility graph.

Hence, when the sequences on each blob on T(M) can be generatedwith a single recombination node, the blobbed-tree minimizesthe number of recombination nodes over all phylogeneticnetworks and all choices of ancestral sequence. This solvesthe root-unknown galled-tree problem in polynomial time.Code is on the web.

Page 52: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

The number of arrangements on agall (all-0 ancestral sequence)

By analysing the algorithm to layout the sites on a gall (not discussed here), one canprove that the number of arrangements of any gall isat most three, and this happens only if the gall has two sites.

If the gall has more than two sites, then the number ofarrangements is at most two.

If the gall has four or more sites, with at least two siteson each side of the recombination point (not the side ofthe gall) then the arrangement is forced and unique.

Page 53: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Computing close bounds on the minimum number of

recombinationsDan Gusfield UCD

Y. Song, Y. F. Wu, D. Gusfield (ISMB2005)

D. Gusfield, D. Hickerson (Dis. Appl. Math 2005)

D. Gusfield, V. Bansal (Recomb 2005)

Page 54: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

The grandfather of all lower bounds - HK 1985

• Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. This bound requires a linear order.

• The HK bound is the minimum number of vertical lines needed to cut every edge in the incompatibility graph. Weak bound, but widely used - not only to bound the number of recombinations, but also to suggest their locations.

Page 55: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Justification for HK

If two sites are incompatible, there must have been some recombination where the crossover point is between the two sites.

Page 56: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

1 2 3 4 5

HK Lower Bound

Page 57: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

1 2 3 4 5

HK Lower Bound = 1

Page 58: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

More general view of HK

Given a set of intervals on the line, and for each interval I, a number N(I), define the composite problem: Find the minimum number of vertical lines so that every interval I intersects at least N(I) of the vertical lines.

In HK, each incompatibility defines an interval I where N(I) = 1.

The composite problem is easy to solve by a left-to-right myopicplacement of vertical lines: Sort the intervals by right end-point;Process the intervals left to right in that order; when the rightendpoint of an interval I is reached, place there (if needed) additional vertical so that N(I) lines intersect I.

Page 59: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

This general approach is called the Composite Method(Simon Myers 2002).

If each N(I) is a ``local” lower bound on the number ofrecombinations needed in interval I, then the solution tothe composite problem is a valid lower bound for thefull sequences. The resulting bound is called the compositebound given the local bounds.

Page 60: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Haplotype Bound (Simon Myers)

• Rh = Number of distinct sequences (rows) - Number of distinct sites (columns) -1 (folklore)

• Before computing Rh, remove any site that is compatible with all other sites. A valid lower bound results - generally increases the bound.

• Generally really bad bound, often negative, when used on large intervals, but Very Good when used as local bounds in the Composite Interval Method, and other methods.

Page 61: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Composite Interval Method using RH bounds

Compute Rh separately for each of the C(m,2) intervals of the m sites; let N(I) = Rh(I) be the local lower bound for interval I. Then compute the composite bound using these local bounds.

Polynomial time and gives bounds that often double HK in our simulations.

Page 62: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Composite Subset Method (Myers)

• Let S be subset of sites, and Rh(S) be the haplotype bound for subset S. If the leftmost site in S is L and the rightmost site in S is R, then use Rh(S) as a local bound N(I) for interval I = [S,L].

• Compute Rh(S) on many subsets, and then solve the composite problem to find a composite bound.

Page 63: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

RecMin (Myers)

• World Champion Lower Bound Program (until now). Often RecMin gives a bound three times as large as HK.

• Computes Rh on subsets of sites, but limits the size and the span of the subsets. Default parameters are s = 6, w = 15 (s = size, w = span).

• Generally, impractical to set s = w = m, so generally one doesn’t know if increasing the parameters would increase the bound. (example: Myers bound of 70 on the LPL data).

Page 64: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Optimal RecMin Bound (ORB)

• The Optimal RecMin Bound is the lower bound that RecMin would produce if both parameters were set to their maximum values (s = w= m).

• In general, RecMin cannot compute (in practical time) the ORB.

• Practical computation of the ORB is our first contribution.

Page 65: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Computing the ORB

• Gross Idea: For each interval I, use ILP to find a subset of sites S that maximizes Rh(S) over all subsets in interval I. Call the result Opt(I).

• Set N(I) = Opt(I), and compute the composite bound using those local bounds.

• The composite bound is the ORB. -- the result one would get by using all 2^m subsets in RecMin, with s = w = m.

Page 66: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

We have moved from doing an exponential number of simplecomputations (computing Rh for each subset), to solving a quadratic number of (possibly expensive) ILPs.

Is this a good trade-off in practice? Our experience - very much so!

Page 67: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

How to compute Opt(I) by ILPCreate one variable Xi for each row i; one variable Yj for each column j in interval I. All variables are 0/1 variables.

Define S(i,i’) as the set of columns where rows i and i’ have different values. Each column in S(i,i’) is a witness that rows i and i’ differ.

For each pair of rows (i,i’), create the constraint: Xi + Xi’ <= 1 + ∑ [Yj: jin S(i,i’)]

Objective Function: Maximize ∑ Xi - ∑ Yj -1

Page 68: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Alternate way to compute Opt(I) by ILP

Create one variable Yj for each column jin interval I. All variables are 0/1 variables.

S(i,i’) as before.

For each pair of rows (i,i’) create the constraint: 1 <= ∑ [Yj: jin S(i,i’)]

Objective Function: Maximize N - ∑(Yj) -1

Finds the smallest number of columns to distinguish the rows.

First remove any duplicate rows. Let N be the number of rowsremaining.

Page 69: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Two critical tricks

1) Use the second ILP formulation to compute Opt(I). It solves much faster than the first (why?)

2) Reduce the number of intervals where Opt(I) is computed:

If the solution to Opt(I) uses columns thatonly span interval L, then there is noneed to find Opt(I’) in any interval I’ containing L. Each ILP computationdirectly spawns at most 4 other ILPs. Apply this idea recursively.

I

L

Page 70: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

With the second trick we need to find Opt(I) for only 0.5% - 35% of all the C(m,2) intervals (empirical result). Surprisingly fast in practice (with either the GNU solver or CPLEX).

Page 71: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Bounds Higher Than the Optimal RecMin Bound

• Often the ORB underestimates the true minimum, e.g. Kreitman’s ADH data: 6 v. 7

• How to derive sharper bound? Idea: In the composite method, check if each local bound N(I) = Opt(I) is tight, and if not, increase N(I) by one.

• Small increases in local bounds can increase the composite bound.

Page 72: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Bounds Sharper Than Optimal RecMin Bound

• A set of sequences M is called self-derivable if there is a network generating M where the sequence at every node (leaf and intermediate) is in M.

• Observation: The haplotype bound for a set of sequences is tight if and only if the sequences are self-derivable.

• So for each interval I where Opt(I) is computed, we check self-derivability of the sequences induced by columns S*, where S* are the columns specified by Opt(I). Increase N(I) by 1 if the sequences are not self-derivable.

Page 73: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Algorithm for Self-Derivability

• Solution is easy when there are no mutations --only recombinations are allowed and one initial pair of sequences is chosen as ``reached”.

• Two reached sequences S1 and S2 can reach a third sequence S3, if S3 can be created by recombining S1 and S2.

• Do BFS to see if all sequences can be reached by successive application of the ``reach operation”.

• Clearly polynomial time and can be optimized with suffix-trees etc. (Kececiouglu, Gusfield)

Page 74: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Self-derivability Test with mutations allowed

• For each site i, construct set MUT(i) of sequence pairs (S1, S2) in M where S1 and S2 differ at only site.

• Try each sequence in M as root (which is the only reached sequence initially).

• For each root, enumerate all ways of choosing exactly one ordered pair of sequences (Sp, Sq) from each MUT(i). Sp is allowed to ``reach” Sq.

• Run the prior self-derivability algorithm with these new permitted reach relations, to test if all sequences in M can be reached. If so, then M is self-derivable, otherwise it is not.

Page 75: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Checking if N(I) should be increased by two

If the set of sequences is not self-derivable,we can test if adding one new sequence makes

it self-derivable.the number of candidate sequences is

polynomial and for each one added to M we check self-derivability. N(I) should be increased by two if none of these sets of sequences is self-derivable.

Page 76: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Program HapBound

• HapBound –S. Checks each Opt(I) subset for self-derivability. Increase N(I) by 1 or 2 if possible. This often beats ORB and is still practical for most datasets.

• HapBound –M. Explicitly examine each interval directly for self-derivability. Increase local bound if possible. Derives lower bound of 7 for Kreitman’s ADH data in this mode.

Page 77: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

HapBound vs. RecMin on LPL from Clark et al.

Program Lower Bound Time

RecMin (default) 59 3s

RecMin –s 25 –w 25 75 7944s

RecMin –s 48 –w 48 No result 5 days

HapBound 75 31s

HapBound -S 78 1643s

2 Ghz PC

Page 78: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Example where RecMin has difficulity in Finding Optimal Bound

on a 25 by 376 Data MatrixProgram Bound Time

RecMin default 36 1s

RecMin –s 30 –w 30 42 3m 25s

RecMin –s 35 –w 35 43 24m 2s

RecMin –s 40 –w 40 43 2h 9m 4s

RecMin –s 45 –w 45 43 10h 20m 59s

HapBound 44 2m 59s

HapBound -S 48 39m 30s

Page 79: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Frequency of HapBound –S Bound Sharper Than Optimal

RecMin Boundms param. Rho=1 Rho=5 Rho=10 Rho=20

Theta=1 0.0% 0.4% 0.5% 1.5%

Theta=5 0.7% 4.0% 10.4% 27.0%

Theta=10 1.4% 9.2% 17.8% 40.4%

Theta=20 1.4% 10.5% 27.8% 45.4%

For every ms parameters, 1000 data sets are used.

Page 80: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Computing Upper Bounds

• The method is an adaptation of the ``history” lower bound (Myers).

• A non-informative column is one with fewer than two 0’s or fewer than two 1’s.

Page 81: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Single History Computation

1) Set W = 02) Collapse identical rows together, and remove non-

informative columns. Repeat until neither is possible.3) Let A be the data at this point. If A is empty, stop, else

remove some row r from A, and set W = W + 1. Go to step 2).

Note that the choice of r is arbitrary in Step 3), so the resulting W can vary.

Page 82: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

History Lower Bound

Theorem (Myers) Let W* be the minimum W obtained from all possible single history computations.

Then W* is a valid lower bound on the number of recombinations needed.

Naïve time: theta(n!) (RecMin), but can bereduced to theta(2^n) (Bafna, Bansal).

Page 83: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Converting the History Lower Bound to an Upper Bound

• Given a set of rows A and a single row r, define w(r | A - r) as the minimum number of recombinations needed to create r from A-r (well defined in our application).

• w(r | A-r) can be computed in linear time by a greedy-type algorithm.

Page 84: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Upper Bound Computation

1) Set W = 02) Collapse identical rows together, and remove non-

informative columns. Repeat until neither is possible.3) Let A be the data at this point. If A is empty, stop, else

remove some row r from A, and set W = W + W(r | A-r). Go to step 2).

Note that the choice of r is arbitrary in Step 3), so the resulting W can vary.

This is the Single History Computation, with a change instep 3).

Page 85: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Note, even a single execution of the upper bound computation gives a valid upper bound, and a way to construct a network. This is in contrast to the History Bound which requires finding the minimum W over all histories.

However, we also would like to find the lowest upper bound possible with this approach. This can be done in O(2^n) time by DP. In practice, we can use branch and bound to find the lowest upper bound, but we have also found that branching on the best local choice, or randomizing gives good results.

Page 86: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Branch and Bound

(Branching) In Step 3) choose r to minimize w(r | A-r) + L(A-r), where L(A-r) is some

fast lower bound on the number of recombinations needed for the set A-r. Even HK is good for this purpose.

(Bounding) Let C be the min for an full solution found so far; If W + L(A) >= C, then backtrack.

Page 87: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Kreitman’s 1983 ADH Data

• 11 sequences, 43 segregating sites

• Both HapBound and SHRUB took only a fraction of a second to analyze this data.

• Both produced 7 for the number of detected recombination events

Therefore, independently of all other methods, our lower and upper bound methods together imply that 7 is the minimum number of recombination events.

Page 88: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

A minimal ARG for Kreitman’s data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

SHRUB produces code that can be input to an open source program to display the constructed ARG

Page 89: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

The Human LPL Data (Nickerson et al. 1998)

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Our new lower and upper bounds

Optimal RecMin Bounds

(We ignored insertion/deletion, unphased sites, and sites with missing data.)

(88 Sequences, 88 sites)

Page 90: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Study on simulated data: Exact-Match frequency for varying parameters

= Scaled mutation rate = Scaled recombination rate

• n = Number of sequences

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

n = 25

For < 5, our lower and upper bounds match more than 95% of the time.

Used Hudson’s MS to generate1000 simulated datasets for each pair of and

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

n = 15

Page 91: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Exact-Match frequency for varying number of sequences

Match frequency does not depend on n as much as it does on or

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

Page 92: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

A closer look at the deviation

Average ratio of lower bound to upper bound when they do not match

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

For n = 25:

The numerical difference between lower and upper bounds grows as or increases, but their ratio is more stable.

Page 93: Reconstructing Ancestral Recombination Graphs - or Phylogenetic Networks with Recombination

Papers and Software

wwwcsif.cs.ucdavis.edu/~gusfield/