Approximating the subtree distance between phylogenies. C Semple

Fixed-Parameter and Approximation Algorithms forthe Subtree Distance Between Phylogenies

Charles Semple

Biomathematics Research Centre

Department of Mathematics and Statistics

University of Canterbury

New Zealand

Joint work with Magnus Bordewich, Catherine McCartin

e-Science Institute, Edinburgh 2007

Subtree Prune and Regraft (SPR)

Example. r

Sda cb

1 SPR

bdcaT2

r

daT1

cb

r

1 SPR

Applications of SPR

Used

I. As a search tool for selecting the best tree inreconstruction algorithms.

II. To quantify the dissimilarity between two phylogenetictrees.

III. To provide a lower bound on the number of reticulationevents in the case of non-tree-like evolution.

For II and III, one wants the minimum number of SPR operations totransform one phylogeny into another.

This number is the SPR distance between two phylogenies S and T.

The Mathematical Problem

MINIMUM SPRInstance: Two rooted binary phylogenetic trees S and T.Goal: Find a minimum length sequence of single SPR operations that

transforms S into T.Measure: The length of the sequence.

Notation: Use dSPR(S, T) to denote this minimum length.

Theorem (Bordewich, S 2004)MINIMUM SPR is NP-hard.

Overriding goal is to find (with no restrictions) the exact solution or aheuristic solution with a guarantee of closeness.

Algorithms for NP-Hard Problems

Fixed-parameter algorithms are a practical way to find optimalsolutions if the parameter measuring the hardness of the problemis small.

Polynomial-time approximation algorithms can efficiently find feasiblesolutions that are sometimes arbitrarily close to the optimalsolution.

Agreement Forests

A forest of T is a disjointcollection of phylogeneticsubtrees whose union of leafsets is X U r.

Example.

dcbaS

e f

r

r

dcbaF1

e f

Agreement Forests

An agreement forest for S and T is a forest of both S and T.

Example.

r

dcbaF1

e f dcbaF2

e f

r

afedT

b c

r

dcbaS

e f

r

Agreement Forests

An agreement forest for S and T is a forest of both S and T.

Example.

dcbaS

e f

r

r

dcbaF1

e f

afedT

b c

r

dcbaF2

e f

r

Theorem. (Bordewich, S, 2004)Let S and T be two binary phylogenetic trees. Then

dSPR(S,T) = size of maximum-agreement forest - 1.

o It’s fast to construct from a maximum-agreement forest for Sand T a sequence of SPR operations that transforms S into T.

Reducing the Size of the Instance

Subtree Reduction Chain Reduction

Fixed-Parameter Algorithms

The underlying idea is to find an algorithm whose running timeseparates the size of the problem instance from the parameter ofinterest.

One way to obtain such an algorithm is to reduce the size of theproblem instance, while preserving the optimal value (kernalizingthe problem).

Are the subtree and chain reductions enough to kernalize theproblem?


Lemma. If n’ denotes the size of the leaf sets of the fully reducedtrees using the subtree and chain reductions, then

n’ < 28dSPR(S,T).

Corollary. (Bordewich, S 2004) MINIMUM SPR is fixed-parametertractable.

1. Repeatedly apply the subtree and chain rules.2. Exhaustively find a maximum-agreement forest by deleting edges

in S and comparing with T.

Running time is O((56k)k + p(n)) compared with O((2n)k), wherek=dSPR(S,T) and p(n) is the polynomial bound for reducing thetrees using the subtree and chain reductions.


A second way to obtain a fixed-parameter algorithm is using themethod of bounded search trees.

Preliminaries

o A rooted triple is a binary tree with three leaves.

o A tree S displays ab|c if the last common ancestor of ab is adescendant of the last common ancestor of ac.

cbaab|c


Observation. If F is an agreementforest for S and T, then F canbe obtained from S bydeleting |F|-1 edges.

Recall. If F is maximum, then|F|-1=dSPR(S,T).

dcbaS

e f

r

afedT

b c

r


1. Find a minimal triple xy|z in Snot in T.

2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.

3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.

dcbaS

e f

r

afedT

b c

r





dcbaS

e f

r

afedT

b c

r

ea eb





dcbaS

e f

r

afedT

b c

r

er

ea eb





dcbaS

e f

r

afedT

b c

r

er

edea eb





dcbaS

e f

r

afedT

b c

r

er

edea eb

ed

eaeb

er





dcbaS

e f

r

afedT

b c

r

ed

eaeb

er





dcbaS

e f

r

afedT

b c

r

er

ea eb

ee

ed

eaeb

er





dcbaS

e f

r

afedT

b c

r

er

edea eb

ed

eaeb

er





dcbaS

e f

r

afedT

b c

r

ed

eaeb

er





dcbaS

e f

r

afedT

b c

r

vst

ed

eaeb

er





dcbaS

e f

r

afedT

b c

r

vst

et

es

ed

eaeb

er





dcbaS

e f

r

afedT

b c

r

vst

et

es

ed

eaeb

eres

et





dcbaS

e f

r

afedT

b c

r

ed

eaeb

eres

et


Key Lemma. dSPR(S,T) <= k if and onlyif one of the computationalpaths of length at most kresults in an agreement forestfor S and T.

Theorem. (Bordewich, McCartin, S2007)

The bounded search tree algorithmhas running time O(4kn4).

Combining the methods ofkernalization and boundedsearch gives an FPT algorithmwith running time O(4kk4+p(n)).

dcbaS

e f

r

afedT

b c

r

ed

eaeb

eres

et

Approximation Algorithms

Polynomial-time approximation algorithms can efficiently find feasiblesolutions that are sometimes arbitrarily close to the optimalsolution.

An r-approximation algorithm A for MINIMUM SPR means that thesize of an agreement forest (minus 1) for S and T returned by A isat most r times dSPR(S,T).

Example. If r=3, then the size of any agreement forest (minus 1) for Sand T returned by A is at most 3 times dSPR(S,T).

Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta2006 establish a 5-approximation algorithm (O(n)) for MINIMUMSPR.

Approximation Algorithms


2. Delete each of ex, ey, ez, er, andrepeat until no such triple.

3. Find a pair of overlappingcomponents ts and tt and deletees, et. Repeat until no suchcomponents.

Theorem. (Bordewich, McCartin, S2007)

This gives a 4-approximation alg. forMINIMUM SPR (O(n5)).

Deleting only ex, ez, er gives a 3-approximation algorithm.

dcbaS

e f

r

afedT

b c

r

ed

eaeb

eres

et

Exactalgorithmsr=1

Maximumindependentset in planargraphs (PTAS)

1. Travellingsalesmanproblem

2. Maximumclique

No r, unlessP=NP

MINIMUM SPR

(B, S 07)1 + 1/2112

(B, M,S 07)3

5(B S,M, A 06)

Modelling Hybridization Events with SPR Operations

Reticulation processes causespecies to be a composite ofDNA regions derived fromdifferent ancestors.

Processes include

o horizontal gene transfer,

o hybridization, and

o recombination.

… molecular phylogeneticists willhave failed to find the `truetree’, not because theirmethods are inadequate orbecause they have chosenthe wrong genes, butbecause the history of lifecannot be properlyrepresented as a tree.

Ford Doolittle, 1999(Dalhousie University)


A single SPR operation models a single hybridization event (Maddison1997).

Example.r

Sda cb da

Tcb

r

r

Hda cb


A single SPR operation models a single hybridization event (Maddison1997).

Example.r

Sda cb da

Tcb

r

deletinghybrid edges

r

Hda cb

F

r

da cb

A Fundamental Problem for Biologists

Given an initial set of data that correctly repesents the tree-likeevolution of different parts of various species genomes,

what is the smallest number of reticulation events required thatsimultaneously explains the evolution of the species?

This smallest number

o Provides a lower bound on the number of such events.

o Indicates the extent that hybridization has had on theevolutionary history of the species under consideration.

Since 1930’s botantists have asked the question: How significant hasthe effect of hybridization been on the New Zealand flora?

Trees and Hybridization Networks

H explains T if T can be obtained from a rooted subtree of H bysuppressing degree-2 vertices.

Example.

dcbaS

dcbaH1

dbcaT

dcbaH2

The Mathematical Problem

MINIMUM HYBRIDIZATION

Instance: Two rooted binary phylogenetic trees S and T.

Goal: Find a hybridization network H that explains S and T, andminimizes the number of hybridization vertices.

Measure: The number of hybridization vertices in H.

Notation: Use h(S, T) to denote this minimum number.

Example: Arbitrary SPR operations not sufficient.

r

dcbaF1

e f

o A sequence of SPR operations that avoids creating directed cyclesto make a hybridization network that explains S and T.

o If one minimizes the length of an (acyclic) sequence, does theresulting network minimize the number of hybridization events toexplain S and T?

o YES, and such a sequence corresponds to an acyclic-agreementforest.

Theorem. (Baroni, Grünewald, Moulton, S, 2005)Let S and T be two binary phylogenetic trees. Then

h(S,T) = size of maximum-acyclic agreement forest - 1.

o It’s fast to construct from a maximum-acyclic agreement for Sand T a hybridization network that realizes h(S,T).

Theorem. (Bordewich, S, 2007)

MINIMUM HYBRIDIZATION is NP-hard.


Are the subtree and chain reductions enough to kernalize theproblem?

Lemma. If n’ denotes the size of the leaf sets of the fully reducedtrees using the subtree and chain reductions, then

n’<14h(S,T).

Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION isfixed-parameter tractable.

Running time is O((28k)k + p(n)) compared with O((2n)k), wherek=h(S,T) and p(n) is the polynomial bound for reducing the treesusing the subtree and chain reductions.


Cluster Reduction (Baroni 2004)

+

A Grass (Poaceae) Dataset (Grass Phylogeny Working Group,Düsseldorf)

o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneousplant hybrids)

o For each sequence, used fastDNAml to reconstruct a phylogenetictree (H. Schmidt).

Chloroplast (phyB) sequences Nuclear (ITS) sequences

Nuclear (ITS) sequencesChloroplast (phyB) sequences

Chloroplast (phyB) sequences Nuclear (ITS) sequences


h=3

h=1

h=4

h=0

620s815ITSwaxy

at least 1031ITSrpoC2

1s110waxyrpoC2

at least 929ITSrbcL

230s712waxyrbcL

29.5h1326rpoC2rbcL

19s830ITSphyB

1s314waxyphyB

180s721rpoC2phyB

1s421rbcLphyB

at least 1546ITSndhF

320s919waxyndhF

26.3h1234rpoC2ndhF

11.8h1336rbcLndhF

11h1440phyBndhF

running time2000 MHz

CPU, 2GB RAM

h(S,T)# overlappingtaxa

pairwise combination

Bordewich, Linz, St John, S, 2007

Computing dSPR(S,T) and h(S,T)

dSPR(S,T)

1. FPT using kernalization andbounded search treemethods (O(4kk4+p(n))).

2. No cluster-based reduction.

3. 3-approximation algorithm.

h(S,T)

1. FPT using kernalization only(O((28k)k +p(n))). Unknown ifa bounded search treemethod exists.

2. Cluster-based reduction.

3. Unknown if there even is anapproximation algorithm.



an

a2

a1

S

an

a2

a1

Tx2

x1

S’

x3

x2

x1

T’

x3

Approximating the subtree distance between phylogenies. C Semple

Technology

Transcript of Approximating the subtree distance between phylogenies. C Semple