ReCombinatorics: Phylogenetic Networks with Recombination

64
ReCombinatorics: Phylogenetic Networks with Recombination CPM, June 18, 2008 Pisa, Italy Two recent results and Two Open Questions

description

ReCombinatorics: Phylogenetic Networks with Recombination. Two recent results and Two Open Questions. CPM, June 18, 2008 Pisa, Italy. What is population genomics?. The Human genome “sequence” is done. - PowerPoint PPT Presentation

Transcript of ReCombinatorics: Phylogenetic Networks with Recombination

Page 1: ReCombinatorics: Phylogenetic Networks with Recombination

ReCombinatorics: Phylogenetic Networks with Recombination

CPM, June 18, 2008Pisa, Italy

Two recent results and Two Open Questions

Page 2: ReCombinatorics: Phylogenetic Networks with Recombination

What is population genomics?

• The Human genome “sequence” is done.• Now we want to sequence many individuals

in a population to correlate similarities and differences in their sequences with genetic traits (e.g. disease or disease susceptibility).

• Presently, we can’t sequence large numbers of individuals, but we can sample the sequences at SNP sites.

Page 3: ReCombinatorics: Phylogenetic Networks with Recombination

SNP Data• A SNP is a Single Nucleotide Polymorphism - a site in the

genome where two different nucleotides appear with sufficient frequency in the population (say each with 5% frequency or more).

• SNP maps have been compiled with a density of about 1 site per 1000.

• SNP data is what is mostly collected in populations - it is much cheaper to collect than full sequence data, and focuses on variation in the population, which is what is of interest.

Page 4: ReCombinatorics: Phylogenetic Networks with Recombination

Haplotype Map Project: HAPMAP

• NIH lead project ($100M) to find common SNP haplotypes (“SNP sequences”) in the Human population.

• Association mapping: HAPMAP used to try to associate genetic-influenced diseases with specific SNP haplotypes, to either find causal haplotypes, or to find the region near causal mutations.

• The key to the logic of Association mapping is historical recombination in populations. Nature has done the experiments, now we try to make sense of the results.

Page 5: ReCombinatorics: Phylogenetic Networks with Recombination

The Perfect Phylogeny Model for SNP sequences

000001

2

4

3

510100

1000001011

00010

01010

12345sitesAncestral sequence

Extant sequences at the leaves

Site mutations on edgesThe tree derives the set M:1010010000010110101000010

Only one mutation per siteallowed.

Page 6: ReCombinatorics: Phylogenetic Networks with Recombination

Classic NASC: Arrange the sequences in a matrix. Then (with no duplicate columns), the sequences can be generated on a unique perfect phylogeny if and only if no two columns (sites) contain all four pairs:

0,0 and 0,1 and 1,0 and 1,1

This is the 4-Gamete Test

When can a set of sequences be derived on a perfect phylogeny?

Page 7: ReCombinatorics: Phylogenetic Networks with Recombination

A richer model

000001

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 added

Pair 4, 5 fails the fourgamete-test. The sites 4, 5are incompatible.

Real sequence histories often involve recombination.

M

Page 8: ReCombinatorics: Phylogenetic Networks with Recombination

10100 01011

5

10101

The first 4 sites come from P (Prefix) and the sitesfrom 5 onward come from S (Suffix).

P S

Sequence Recombination

A recombination of P and S at recombination point 5.

Single crossover recombination

Page 9: ReCombinatorics: Phylogenetic Networks with Recombination

Network with Recombination: ARG

000001

2

4

3

510100

1000001011

00010

01010

12345101001000001011010100001010101 new

10101

The previous tree with onerecombination event now derivesall the sequences.

5

P

S

M

Page 10: ReCombinatorics: Phylogenetic Networks with Recombination

A Min ARG for Kreitman’s data

QuickTime™ and aTIFF (LZW) decompressor

are needed to see this picture.

ARG created by SHRUB

Page 11: ReCombinatorics: Phylogenetic Networks with Recombination

An illustration of why we are interested in recombination:

Association Mapping of Complex Diseases Using

ARGs

Page 12: ReCombinatorics: Phylogenetic Networks with Recombination

Association Mapping

• A major strategy being practiced to find genes influencing disease from haplotypes of a subset of SNPs.– Disease mutations: unobserved.

• A simple example to explain association mapping and why ARGs are useful, assuming the true ARG is known.

0 1 0 0 1Disease mutation site

SNPs

Page 13: ReCombinatorics: Phylogenetic Networks with Recombination

00000

52

3

3

4SP

PS

1

4a:00010

b:10010

c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

Very Simplistic Mapping the Unobserved Mutation of Mendelian Diseases with ARGs

Diseased

Assumption (for now): A sequence is diseased iff it carries the single disease mutation

Where is the disease mutation?

1 2 3 4 5

What part of 01100 d, e, f inherit?

d: e:f:

? ?

The single disease mutation occurs between sites 2 and 3!

Page 14: ReCombinatorics: Phylogenetic Networks with Recombination

Mapping Disease Gene with Inferred ARGs

• “..the best information that we could possibly get about association is to know the full coalescent genealogy…” – Zollner and Pritchard, 2005

• But we do not know the true ARG! • Goal: infer ARGs from SNP data for

association mapping– Not easy and often approximation (e.g. Zollner and

Pritchard)– Improved results to do the inference Y. Wu (RECOMB 2007)

Page 15: ReCombinatorics: Phylogenetic Networks with Recombination

Results on Reconstructing the Evolution of SNP Sequences

• Part I: Clean mathematical and algorithmic results: Galled-Trees, near-uniqueness, graph-theory lower bound, and the Decomposition theorem, Forest Theorem and the History Lower bound.

• Part II: Practical computation of Lower and Upper bounds on the number of recombinations needed. Construction of (optimal) phylogenetic networks; uniform sampling; haplotyping with ARGs; LD mapping …

• Part III: Varied Biological Applications• Part IV: Extension to Gene Conversion• Part V: The Minimum Mosaic Model of Recombination (CPM 2007)

This talk will discuss two topics in Part I

Page 16: ReCombinatorics: Phylogenetic Networks with Recombination

Minimizing Recombinations in unconstrained networks

• Problem: given a set of sequences M, find a phylogenetic network generating M, minimizing the number of recombinations used to generate M, allowing only one mutation per site. This has biological meaning in appropriate contexts.

• The general minimization problem is NP-hard.

• We can solve this problem in poly-time for the special case of Galled-Trees, to be defined.

Page 17: ReCombinatorics: Phylogenetic Networks with Recombination

The Decomposition Theorem

Since the minimization problem is NP-hardwe want to break up a problem into subproblems that can be solved separately and combined.

Page 18: ReCombinatorics: Phylogenetic Networks with Recombination

Incompatible Sites

A pair of sites (columns) of M that fail the4-gametes test are said to be incompatible.

A site that is not in such a pair is compatible.

Page 19: ReCombinatorics: Phylogenetic Networks with Recombination

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

2 5

Two nodes are connected iff the pairof sites are incompatible, i.e, fail the 4-gamete test.

Incompatibility Graph G(M)

M

THE MAIN TOOL: We represent the pairwise incompatibilities in a incompatibility graph.

Page 20: ReCombinatorics: Phylogenetic Networks with Recombination

The connected components of G(M) are very informative

For example we have the Theorem:

The number of non-trivial connected components is a lower-bound on the number of recombinations needed in any network. We will see

that the non-trivial connected components are the key to the finestpossible decomposition, and have other essential uses.

Page 21: ReCombinatorics: Phylogenetic Networks with Recombination

Recombination Cycles

• In a Phylogenetic Network, with a recombination node x, if we trace two paths backwards from x, then the paths will eventually meet.

• The cycle specified by those two paths is called a ``recombination cycle”.

Page 22: ReCombinatorics: Phylogenetic Networks with Recombination

A maximal set of intersecting cycles forms a Blob

00000

52

3

3

4Sp

PS

1

4

10010

0110000101

01101

00100

00010

If directions on the edges are removed, a blob isa bi-connected component of the network.

Page 23: ReCombinatorics: Phylogenetic Networks with Recombination

Blobed Trees

• Contracting each blob in a network results in a directed, rooted tree, otherwise one of the “blobs” was not maximal. Simple, but key insight.

• So every phylogenetic network can be viewed as a directed tree of blobs - a blobbed-tree.

The blobs are the non-tree-like parts of the network.

A blob that is just a single cycle is called a “gall”, and a network where all blobs are galls is called a ``Galled-Tree”.

Page 24: ReCombinatorics: Phylogenetic Networks with Recombination

Ugly tanglednetwork insidethe blob.

Every network is a tree of blobs.

A network where every blob is a single cycle is a Galled-Tree.

Page 25: ReCombinatorics: Phylogenetic Networks with Recombination

A Simple Observation

In any network N for M, all sites from the same non-trivial connected component of G(M) must appear together in a single blob in N.

Page 26: ReCombinatorics: Phylogenetic Networks with Recombination

The Decomposition Theorem Theorem: For any set of sequences M, there is a

phylogenetic network that derives M, where each blob contains all and only the sites in one non-trivial connected component of G(M). The compatible sites can always be put on edges outside of any blob. This

“fully-decomposed” network is the finest decomposition possible.

Page 27: ReCombinatorics: Phylogenetic Networks with Recombination

Example: Network for input M with one blob

00000

52

3

3

4Sp

PS

1

4

a:00010

b:10010c:00100

10010

01100

d:10100

e:01100

00101

01101

f:01101

g:00101

00100

00010

Page 28: ReCombinatorics: Phylogenetic Networks with Recombination

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 01101

g: 00101

2

4

p s

ps

1 3

4

2 5

Incompatibility GraphThe fully-decomposednetwork for M

Page 29: ReCombinatorics: Phylogenetic Networks with Recombination

Moreover, the backbone tree and the partition of sites into blobs, and thesequences exported from any blob are allinvariant features of the fully-decomposednetworks for M, and can be determined in polynomial-time.

Page 30: ReCombinatorics: Phylogenetic Networks with Recombination

So, we can find a network for M by solvingthe (rooted) recombination minimization problem for each connected component of G(M) separately, and then connect those subnetworks in an invariant way.

The resulting network will be a networkwith the fewest recombination nodes overall fully-decomposed networks for M.

Page 31: ReCombinatorics: Phylogenetic Networks with Recombination

Algorithmically

• Finding the tree part of the blobbed-tree is easy.• Determining the sequences labeling the exterior nodes on any blob is

easy.• Determining a “good” structure inside a blob B is the problem of

generating the sequences of the exterior nodes of B. • It is easy to test whether the exterior sequences on B can be generated

with only a single recombination. The original galled-tree problem is now just the problem of testing whether one single-crossover recombination is sufficient for each blob.

• That can be solved by successively removing each exterior sequence and testing if the remaining sequences can be generated on a perfect phylogeny of the correct form.

Page 32: ReCombinatorics: Phylogenetic Networks with Recombination

Proof Ideas

Let C be a connected component of G(M). Define M[C] as the sequences in M restricted to the sites in C.

Page 33: ReCombinatorics: Phylogenetic Networks with Recombination

0 0 0 1 01 0 0 1 00 0 1 0 01 0 1 0 00 1 1 0 00 1 1 0 10 0 1 0 1

1 2 3 4 5abcdefg

1 3

4

M

a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 0 1 0

1 3 4

C1

2 5

C2

abcdefg

0 00 00 00 01 01 10 1

2 5

B1 B2

M[C1] M[C2]

Page 34: ReCombinatorics: Phylogenetic Networks with Recombination

a 0 0 1b 1 0 1c 0 1 0d 1 1 0e 0 1 0f 0 1 0g 01 0

1 3 4abcdefg

0 00 00 00 01 01 10 1

2 5

M[C1] M[C2]

abcdefg

1 0 0 0 1 0 0 00 1 0 0 1 0 0 00 0 1 0 1 0 0 00 0 0 1 1 0 0 00 0 1 0 0 1 0 00 0 1 0 0 0 1 00 0 1 0 0 0 0 1

W

1 2 3 4 5 6 7 8

1234333

5555678

Now for each connected component C in G(M), call each distinct sequence in M[C] a supercharacter, and let W be the indicatormatrix for the supercharacters. So W indicates which rows of Mcontain which particular supercharacters.

Page 35: ReCombinatorics: Phylogenetic Networks with Recombination

Proof Ideas

Lemma: No pair of supercharacters are incompatible.

So by the NASC for a Perfect Phylogeny, there is a unique perfect phylogeny T for W.

Page 36: ReCombinatorics: Phylogenetic Networks with Recombination

Proof Ideas

For each connected component C of G(M), all supercharacters that originate from C label edges in T that are incident with one single node v[C] in T. So, if we expand each node v[C] to be a network that generates the supercharacters from C (the sequences in M[C]), and connect each network correctly to the edges in T, the resulting network is a fully-decomposed blobbed-tree that generates M.

Page 37: ReCombinatorics: Phylogenetic Networks with Recombination

However …While fully-decomposed networks always exist, they

do not necessarily minimize the number of recombination nodes, over all possible networks.

That is, sometimes it pays to put sites from different connected components together on the same blob.

Page 38: ReCombinatorics: Phylogenetic Networks with Recombination

But we can prove several useful sufficient conditionsfor when there is a fully-decomposed network that minimizes thenumber of recombinations, over all possible networks.

The deepest result:Theorem: Let N be a phylogenetic network for input M, let L be the set of sequences that label the nodes of N, and let G(L) be the incompatibility graph for L. If G(L) and G(M) have the same number of connected components, then there is a fully-decomposed network for M with the same number of recombinations as in N.

JCB December 2007

Sufficient Conditions

Page 39: ReCombinatorics: Phylogenetic Networks with Recombination

Corollary

A fully-decomposed network exists thatminimizes the number of recombinations,unless every optimal network uses somerecombination node(s) labeled by sequence(s)not in M, and the addition of those sequencesto M creates an incompatibility between sitesin different components of G(M).

Page 40: ReCombinatorics: Phylogenetic Networks with Recombination

0000003 4

5 1

p4

001000011010

010010

2 6

100001 100101

000100

3 5p s

s100010

G(L) has one component. The addition of sequence 100010 reduces the number of components from 2 to 1.

Sequences in M are in black.Sequence 100010 is not in M.

G(M) has two components. Eachrequires two recs, butthis combined network needs only three.

ps

Page 41: ReCombinatorics: Phylogenetic Networks with Recombination

1 2 3 4 5 6

G(M) for the original data

Two components, so two blobs,each blob requires two recombs,by the HK lower bound theorem,so a fully decomposed networks needsat least four recombinations

Page 42: ReCombinatorics: Phylogenetic Networks with Recombination

1 2 3 4 5 6

G(L) created from the original data, and the addition of the newinterior sequence 100010. G(L) has only one connectedcomponent compared to two components for G(M).

Page 43: ReCombinatorics: Phylogenetic Networks with Recombination

A Practical Sufficient Condition

If M can be derived on a network N in whichevery edge contains at mostone site, and every node is labeled with asequence in M, then there is a fully-decomposed network for M whichminimizes the number of recombinations over all possible networks for M.

Page 44: ReCombinatorics: Phylogenetic Networks with Recombination

Another Practical Sufficient Condition

If M can be derived on a network N wherethe number of recombinations equals the(poly-computable) Haplotype Lower Bound, then there is a fully decomposed networkfor M which minimizes the number of recombinations over all possible networks.

Page 45: ReCombinatorics: Phylogenetic Networks with Recombination

Theorem: For any K, there is a dataset where best fully-decomposednetwork uses K recombinations more than optimal.

In that construction, the ratio of the number of recombinationsin the best fully-decomposed network to the optimal is constant asK grows.

Open Question: Construct examples where the show that theratio can be arbitrarily large.

Page 46: ReCombinatorics: Phylogenetic Networks with Recombination

A New Recombination Lower Bound and The Minimum Perfect

Phylogenetic Forest Problem

Yufeng Wu and Dan GusfieldUC Davis

COCOON’07 July 16, 2007

Page 47: ReCombinatorics: Phylogenetic Networks with Recombination

History Bound (Myers & Griffiths 2003)

000

100

010

011

111

Iterate the following operations1. Remove a column with a single 0 or 12. Remove a duplicate row3. Remove any row

History bound: the minimum number of type-3 operations needed to reduce the matrix to empty

000

100

010

011

00

10

01

01

Empty.

One type-3 operation

00

10

01

M

Page 48: ReCombinatorics: Phylogenetic Networks with Recombination

Graphical interpretation of history bound (HistB)

• Each operation in the history computation corresponds to an operation that deconstructs the optimal, but unknown ARG.

• We deconstruct the optimal ARG by removing tree parts as long as possible; then remove an exposed recombination node; repeat.

• Removing an exposed recombination node in the ARG corresponds to a single type-3 operation. So when deconstructing the optimal ARG, the number of recombination nodes = number of type-3 operations.

• Since the optimal ARG is unknown, the history bound is the minimum

number of type-3 operations needed to make the matrix empty.

Page 49: ReCombinatorics: Phylogenetic Networks with Recombination

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 00101g: 00101

2p s

a: 00010b: 10010c: 00100d: 10100

e: 01100

f: 00101g: 00101

Operations on M correspond to operations on the optimal ARG

M

Page 50: ReCombinatorics: Phylogenetic Networks with Recombination

4

1

3

2 5

a: 00010

b: 10010

d: 10100

c: 00100

e: 01100

f: 00101

2p s

a: 00010b: 10010c: 00100d: 10100

e: 01100

f: 00101

12345

Type-2 operation

Page 51: ReCombinatorics: Phylogenetic Networks with Recombination

4

1

3

a: 001

b: 101

d: 110

c: 010

e: 010

f: 010

2p s

a: 001b: 101c: 010d: 110

e: 010

f: 010

134

Type-1 operations

Page 52: ReCombinatorics: Phylogenetic Networks with Recombination

4

1

3

a: 001

b: 101

d: 110

c: 010

2p s

a: 001b: 101c: 010d: 110

134

Type-2 operations

Page 53: ReCombinatorics: Phylogenetic Networks with Recombination

4

1

3

a: 001

b: 101 c: 010

a: 001b: 101c: 010

134

Type-3 operation

Then three more Type-1 operations fully reduce M and the ARG.

Page 54: ReCombinatorics: Phylogenetic Networks with Recombination

History bound• Initially required trying all n! permutations of the rows to

choose the type-3 operations.• The bound can be computed by DP in O(2n) time (Bafna,

Bansal).• On datasets where it can be computed, the history bound is

observed to be higher than (or equal to) all studied lower bounds (about ten of them).

• There is no static definition for what the history bound is -- it is only defined by the algorithms that compute it! The work in this part of the talk comes out of an attempt to find a simple static definition.

Page 55: ReCombinatorics: Phylogenetic Networks with Recombination

Why a static definition matters

• We want a definition of what is being computed, independent of how it is computed, so that we can reason about it and find alternative ways to compute or approximate it.

• For example, with no static definition of the history bound, we don’t know how to formulate an integer linear program to compute it.

Page 56: ReCombinatorics: Phylogenetic Networks with Recombination

Intro. to Forest Bound: Decompose an Optimal ARG to A Forest of Trees,

removing recombination edges

An ARG with three recombinations

After removing recombination edges, four trees result.

The number of trees is precisely the number of recombinations plus one

Page 57: ReCombinatorics: Phylogenetic Networks with Recombination

Idea behind the Forest Bound (FB)

Each tree created in this way contains at mostone occurrence of any site, and each site occursin at most one of the trees. So the trees form aforest of related perfect phylogenies.

Page 58: ReCombinatorics: Phylogenetic Networks with Recombination

Forest Bound

Given a set of sequences M, partition M intothe fewest subsets so that each subset of sequences can be derived on a tree, whereeach site occurs at most once in the forest oftrees. The number of trees, minus one, is a validlower bound on Rmin.

Page 59: ReCombinatorics: Phylogenetic Networks with Recombination

Comparing the Forest Bound (FB) to:

• History Bound (HistB)

• Optimal Haplotype Bound (OhapB): The currently best lower bound that can be computed in practice for biological data.

• Theorem: On any data, OhapB <= FB <= HistB On some data, OhapB < FB < HistBThus the FB is the highest lower bound with a static

definition.

Page 60: ReCombinatorics: Phylogenetic Networks with Recombination

Computing the Forest Bound is NP-Hard

• Optimal haplotype bound is quite good, but NP-hard to compute.

• If the forest bound can be efficiently computable, we do not need to use optimal haplotype bound at all.

• Unfortunately, the forest bound is NP-hard to compute.

• Reduction from Exact-cover-by-3 sets.

Page 61: ReCombinatorics: Phylogenetic Networks with Recombination

Integer Programming Formulation for the Forest Bound

• For sequences with m sites, consider the hypercube all possible 2m sequences.

• Minimizing F is equivalent to reducing the number of Steiner nodes in the forests.

• We also need to ensure the edge linking two nodes in a tree is only labeled with columns that do not appear in other trees.

• Can easily incorporate the missing data in the input.• The IP formulation has exponential size, but practical when the

number of columns is relatively small.

Page 62: ReCombinatorics: Phylogenetic Networks with Recombination

Empirical Results• On random generated dataset with 15 rows and 7 columns, FB >

OhapB on 10% of the data. On more biological meaningful data (generated with simulation program ms), however, OhapB= FB more often.

• On dataset generated by ms with missing entries, FB is more often outperforms an approximate optimal Rh bound:– 30 rows and 7 columns and 30% missing entries: FB was strictly

larger in 8% of the data.– When the level of missing entries is lower, the approx. OhapB

matches the FB more often.

Page 63: ReCombinatorics: Phylogenetic Networks with Recombination

Open Problem

Find a static definition of the history bound, one that can be translated into an

objective function independent of any algorithm; one that can be solved by ILP,

for example.

Page 64: ReCombinatorics: Phylogenetic Networks with Recombination

Papers and software are at:wwwcsif.cs.ucdavis.edu/~gusfield

Thank you.