Approximating the subtree distance between phylogenies. C Semple

66
Fixed-Parameter and Approximation Algorithms for the Subtree Distance Between Phylogenies Charles Semple Biomathematics Research Centre Department of Mathematics and Statistics University of Canterbury New Zealand Joint work with Magnus Bordewich, Catherine McCartin e-Science Institute, Edinburgh 2007

Transcript of Approximating the subtree distance between phylogenies. C Semple

Page 1: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter and Approximation Algorithms forthe Subtree Distance Between Phylogenies

Charles Semple

Biomathematics Research Centre

Department of Mathematics and Statistics

University of Canterbury

New Zealand

Joint work with Magnus Bordewich, Catherine McCartin

e-Science Institute, Edinburgh 2007

Page 2: Approximating the subtree distance between phylogenies. C Semple

Subtree Prune and Regraft (SPR)

Example. r

Sda cb

1 SPR

bdcaT2

r

daT1

cb

r

1 SPR

Page 3: Approximating the subtree distance between phylogenies. C Semple

Applications of SPR

Used

I. As a search tool for selecting the best tree inreconstruction algorithms.

II. To quantify the dissimilarity between two phylogenetictrees.

III. To provide a lower bound on the number of reticulationevents in the case of non-tree-like evolution.

For II and III, one wants the minimum number of SPR operations totransform one phylogeny into another.

This number is the SPR distance between two phylogenies S and T.

Page 4: Approximating the subtree distance between phylogenies. C Semple

The Mathematical Problem

MINIMUM SPRInstance: Two rooted binary phylogenetic trees S and T.Goal: Find a minimum length sequence of single SPR operations that

transforms S into T.Measure: The length of the sequence.

Notation: Use dSPR(S, T) to denote this minimum length.

Theorem (Bordewich, S 2004)MINIMUM SPR is NP-hard.

Overriding goal is to find (with no restrictions) the exact solution or aheuristic solution with a guarantee of closeness.

Page 5: Approximating the subtree distance between phylogenies. C Semple

Algorithms for NP-Hard Problems

Fixed-parameter algorithms are a practical way to find optimalsolutions if the parameter measuring the hardness of the problemis small.

Polynomial-time approximation algorithms can efficiently find feasiblesolutions that are sometimes arbitrarily close to the optimalsolution.

Page 6: Approximating the subtree distance between phylogenies. C Semple

Agreement Forests

A forest of T is a disjointcollection of phylogeneticsubtrees whose union of leafsets is X U r.

Example.

dcbaS

e f

r

r

dcbaF1

e f

Page 7: Approximating the subtree distance between phylogenies. C Semple

Agreement Forests

A forest of T is a disjointcollection of phylogeneticsubtrees whose union of leafsets is X U r.

Example.

dcbaS

e f

r

r

dcbaF1

e f

Page 8: Approximating the subtree distance between phylogenies. C Semple

Agreement Forests

An agreement forest for S and T is a forest of both S and T.

Example.

r

dcbaF1

e f dcbaF2

e f

r

afedT

b c

r

dcbaS

e f

r

Page 9: Approximating the subtree distance between phylogenies. C Semple

Agreement Forests

An agreement forest for S and T is a forest of both S and T.

Example.

dcbaS

e f

r

r

dcbaF1

e f

afedT

b c

r

dcbaF2

e f

r

Page 10: Approximating the subtree distance between phylogenies. C Semple

Agreement Forests

An agreement forest for S and T is a forest of both S and T.

Example.

dcbaS

e f

r

r

dcbaF1

e f

afedT

b c

r

dcbaF2

e f

r

Page 11: Approximating the subtree distance between phylogenies. C Semple

Theorem. (Bordewich, S, 2004)Let S and T be two binary phylogenetic trees. Then

dSPR(S,T) = size of maximum-agreement forest - 1.

o It’s fast to construct from a maximum-agreement forest for Sand T a sequence of SPR operations that transforms S into T.

Page 12: Approximating the subtree distance between phylogenies. C Semple

Reducing the Size of the Instance

Subtree Reduction Chain Reduction

Page 13: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

The underlying idea is to find an algorithm whose running timeseparates the size of the problem instance from the parameter ofinterest.

One way to obtain such an algorithm is to reduce the size of theproblem instance, while preserving the optimal value (kernalizingthe problem).

Are the subtree and chain reductions enough to kernalize theproblem?

Page 14: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

Lemma. If n’ denotes the size of the leaf sets of the fully reducedtrees using the subtree and chain reductions, then

n’ < 28dSPR(S,T).

Corollary. (Bordewich, S 2004) MINIMUM SPR is fixed-parametertractable.

1. Repeatedly apply the subtree and chain rules.2. Exhaustively find a maximum-agreement forest by deleting edges

in S and comparing with T.

Running time is O((56k)k + p(n)) compared with O((2n)k), wherek=dSPR(S,T) and p(n) is the polynomial bound for reducing thetrees using the subtree and chain reductions.

Page 15: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

A second way to obtain a fixed-parameter algorithm is using themethod of bounded search trees.

Preliminaries

o A rooted triple is a binary tree with three leaves.

o A tree S displays ab|c if the last common ancestor of ab is adescendant of the last common ancestor of ac.

cbaab|c

Page 16: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

Observation. If F is an agreementforest for S and T, then F canbe obtained from S bydeleting |F|-1 edges.

Recall. If F is maximum, then|F|-1=dSPR(S,T).

dcbaS

e f

r

afedT

b c

r

Page 17: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

Observation. If F is an agreementforest for S and T, then F canbe obtained from S bydeleting |F|-1 edges.

Recall. If F is maximum, then|F|-1=dSPR(S,T).

dcbaS

e f

r

afedT

b c

r

Page 18: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

Page 19: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

Page 20: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

ea eb

Page 21: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

er

ea eb

Page 22: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

er

edea eb

Page 23: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

er

edea eb

ed

eaeb

er

Page 24: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

er

edea eb

ed

eaeb

er

Page 25: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

ed

eaeb

er

Page 26: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

er

ea eb

ee

ed

eaeb

er

Page 27: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

er

ea eb

ee

ed

eaeb

er

Page 28: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

er

edea eb

ed

eaeb

er

Page 29: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

er

edea eb

ed

eaeb

er

Page 30: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

ed

eaeb

er

Page 31: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

ed

eaeb

er

Page 32: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

vst

ed

eaeb

er

Page 33: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

vst

et

es

ed

eaeb

er

Page 34: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

vst

et

es

ed

eaeb

eres

et

Page 35: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r

ed

eaeb

eres

et

Page 36: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

Key Lemma. dSPR(S,T) <= k if and onlyif one of the computationalpaths of length at most kresults in an agreement forestfor S and T.

Theorem. (Bordewich, McCartin, S2007)

The bounded search tree algorithmhas running time O(4kn4).

Combining the methods ofkernalization and boundedsearch gives an FPT algorithmwith running time O(4kk4+p(n)).

dcbaS

e f

r

afedT

b c

r

ed

eaeb

eres

et

Page 37: Approximating the subtree distance between phylogenies. C Semple

Approximation Algorithms

Polynomial-time approximation algorithms can efficiently find feasiblesolutions that are sometimes arbitrarily close to the optimalsolution.

An r-approximation algorithm A for MINIMUM SPR means that thesize of an agreement forest (minus 1) for S and T returned by A isat most r times dSPR(S,T).

Example. If r=3, then the size of any agreement forest (minus 1) for Sand T returned by A is at most 3 times dSPR(S,T).

Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta2006 establish a 5-approximation algorithm (O(n)) for MINIMUMSPR.

Page 38: Approximating the subtree distance between phylogenies. C Semple

Approximation Algorithms

1. Find a minimal triple xy|z in Snot in T.

2. Delete each of ex, ey, ez, er, andrepeat until no such triple.

3. Find a pair of overlappingcomponents ts and tt and deletees, et. Repeat until no suchcomponents.

Theorem. (Bordewich, McCartin, S2007)

This gives a 4-approximation alg. forMINIMUM SPR (O(n5)).

Deleting only ex, ez, er gives a 3-approximation algorithm.

dcbaS

e f

r

afedT

b c

r

ed

eaeb

eres

et

Page 39: Approximating the subtree distance between phylogenies. C Semple

Exactalgorithmsr=1

Maximumindependentset in planargraphs (PTAS)

1. Travellingsalesmanproblem

2. Maximumclique

No r, unlessP=NP

MINIMUM SPR

(B, S 07)1 + 1/2112

(B, M,S 07)3

5(B S,M, A 06)

Page 40: Approximating the subtree distance between phylogenies. C Semple

Modelling Hybridization Events with SPR Operations

Reticulation processes causespecies to be a composite ofDNA regions derived fromdifferent ancestors.

Processes include

o horizontal gene transfer,

o hybridization, and

o recombination.

… molecular phylogeneticists willhave failed to find the `truetree’, not because theirmethods are inadequate orbecause they have chosenthe wrong genes, butbecause the history of lifecannot be properlyrepresented as a tree.

Ford Doolittle, 1999(Dalhousie University)

Page 41: Approximating the subtree distance between phylogenies. C Semple

Modelling Hybridization Events with SPR Operations

A single SPR operation models a single hybridization event (Maddison1997).

Example.r

Sda cb da

Tcb

r

r

Hda cb

Page 42: Approximating the subtree distance between phylogenies. C Semple

Modelling Hybridization Events with SPR Operations

A single SPR operation models a single hybridization event (Maddison1997).

Example.r

Sda cb da

Tcb

r

r

Hda cb

Page 43: Approximating the subtree distance between phylogenies. C Semple

Modelling Hybridization Events with SPR Operations

A single SPR operation models a single hybridization event (Maddison1997).

Example.r

Sda cb da

Tcb

r

r

Hda cb

Page 44: Approximating the subtree distance between phylogenies. C Semple

Modelling Hybridization Events with SPR Operations

A single SPR operation models a single hybridization event (Maddison1997).

Example.r

Sda cb da

Tcb

r

deletinghybrid edges

r

Hda cb

F

r

da cb

Page 45: Approximating the subtree distance between phylogenies. C Semple

A Fundamental Problem for Biologists

Given an initial set of data that correctly repesents the tree-likeevolution of different parts of various species genomes,

what is the smallest number of reticulation events required thatsimultaneously explains the evolution of the species?

This smallest number

o Provides a lower bound on the number of such events.

o Indicates the extent that hybridization has had on theevolutionary history of the species under consideration.

Since 1930’s botantists have asked the question: How significant hasthe effect of hybridization been on the New Zealand flora?

Page 46: Approximating the subtree distance between phylogenies. C Semple

Trees and Hybridization Networks

H explains T if T can be obtained from a rooted subtree of H bysuppressing degree-2 vertices.

Example.

dcbaS

dcbaH1

dbcaT

dcbaH2

Page 47: Approximating the subtree distance between phylogenies. C Semple

Trees and Hybridization Networks

H explains T if T can be obtained from a rooted subtree of H bysuppressing degree-2 vertices.

Example.

dcbaS

dcbaH1

dbcaT

dcbaH2

Page 48: Approximating the subtree distance between phylogenies. C Semple

Trees and Hybridization Networks

H explains T if T can be obtained from a rooted subtree of H bysuppressing degree-2 vertices.

Example.

dcbaS

dcbaH1

dbcaT

dcbaH2

Page 49: Approximating the subtree distance between phylogenies. C Semple

The Mathematical Problem

MINIMUM HYBRIDIZATION

Instance: Two rooted binary phylogenetic trees S and T.

Goal: Find a hybridization network H that explains S and T, andminimizes the number of hybridization vertices.

Measure: The number of hybridization vertices in H.

Notation: Use h(S, T) to denote this minimum number.

Page 50: Approximating the subtree distance between phylogenies. C Semple

Example: Arbitrary SPR operations not sufficient.

r

dcbaF1

e f

Page 51: Approximating the subtree distance between phylogenies. C Semple

o A sequence of SPR operations that avoids creating directed cyclesto make a hybridization network that explains S and T.

o If one minimizes the length of an (acyclic) sequence, does theresulting network minimize the number of hybridization events toexplain S and T?

o YES, and such a sequence corresponds to an acyclic-agreementforest.

Page 52: Approximating the subtree distance between phylogenies. C Semple

Theorem. (Baroni, Grünewald, Moulton, S, 2005)Let S and T be two binary phylogenetic trees. Then

h(S,T) = size of maximum-acyclic agreement forest - 1.

o It’s fast to construct from a maximum-acyclic agreement for Sand T a hybridization network that realizes h(S,T).

Theorem. (Bordewich, S, 2007)

MINIMUM HYBRIDIZATION is NP-hard.

Page 53: Approximating the subtree distance between phylogenies. C Semple

Reducing the Size of the Instance

Subtree Reduction Chain Reduction

Page 54: Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter Algorithms

Are the subtree and chain reductions enough to kernalize theproblem?

Lemma. If n’ denotes the size of the leaf sets of the fully reducedtrees using the subtree and chain reductions, then

n’<14h(S,T).

Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION isfixed-parameter tractable.

Running time is O((28k)k + p(n)) compared with O((2n)k), wherek=h(S,T) and p(n) is the polynomial bound for reducing the treesusing the subtree and chain reductions.

Page 55: Approximating the subtree distance between phylogenies. C Semple

Reducing the Size of the Instance

Cluster Reduction (Baroni 2004)

+

Page 56: Approximating the subtree distance between phylogenies. C Semple

A Grass (Poaceae) Dataset (Grass Phylogeny Working Group,Düsseldorf)

o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneousplant hybrids)

o For each sequence, used fastDNAml to reconstruct a phylogenetictree (H. Schmidt).

Page 57: Approximating the subtree distance between phylogenies. C Semple

Chloroplast (phyB) sequences Nuclear (ITS) sequences

Page 58: Approximating the subtree distance between phylogenies. C Semple

Nuclear (ITS) sequencesChloroplast (phyB) sequences

Page 59: Approximating the subtree distance between phylogenies. C Semple

Chloroplast (phyB) sequences Nuclear (ITS) sequences

Page 60: Approximating the subtree distance between phylogenies. C Semple

Chloroplast (phyB) sequences Nuclear (ITS) sequences

Page 61: Approximating the subtree distance between phylogenies. C Semple

Nuclear (ITS) sequencesChloroplast (phyB) sequences

Page 62: Approximating the subtree distance between phylogenies. C Semple

Nuclear (ITS) sequencesChloroplast (phyB) sequences

Page 63: Approximating the subtree distance between phylogenies. C Semple

Nuclear (ITS) sequencesChloroplast (phyB) sequences

h=3

h=1

h=4

h=0

Page 64: Approximating the subtree distance between phylogenies. C Semple

620s815ITSwaxy

at least 1031ITSrpoC2

1s110waxyrpoC2

at least 929ITSrbcL

230s712waxyrbcL

29.5h1326rpoC2rbcL

19s830ITSphyB

1s314waxyphyB

180s721rpoC2phyB

1s421rbcLphyB

at least 1546ITSndhF

320s919waxyndhF

26.3h1234rpoC2ndhF

11.8h1336rbcLndhF

11h1440phyBndhF

running time2000 MHz

CPU, 2GB RAM

h(S,T)# overlappingtaxa

pairwise combination

Bordewich, Linz, St John, S, 2007

Page 65: Approximating the subtree distance between phylogenies. C Semple

Computing dSPR(S,T) and h(S,T)

dSPR(S,T)

1. FPT using kernalization andbounded search treemethods (O(4kk4+p(n))).

2. No cluster-based reduction.

3. 3-approximation algorithm.

h(S,T)

1. FPT using kernalization only(O((28k)k +p(n))). Unknown ifa bounded search treemethod exists.

2. Cluster-based reduction.

3. Unknown if there even is anapproximation algorithm.

Page 66: Approximating the subtree distance between phylogenies. C Semple

Reducing the Size of the Instance

Subtree Reduction Chain Reduction

an

a2

a1

S

an

a2

a1

Tx2

x1

S’

x3

x2

x1

T’

x3