Approximating the subtree distance between phylogenies. C Semple
-
Upload
roderic-page -
Category
Technology
-
view
628 -
download
3
Transcript of Approximating the subtree distance between phylogenies. C Semple
Fixed-Parameter and Approximation Algorithms forthe Subtree Distance Between Phylogenies
Charles Semple
Biomathematics Research Centre
Department of Mathematics and Statistics
University of Canterbury
New Zealand
Joint work with Magnus Bordewich, Catherine McCartin
e-Science Institute, Edinburgh 2007
Subtree Prune and Regraft (SPR)
Example. r
Sda cb
1 SPR
bdcaT2
r
daT1
cb
r
1 SPR
Applications of SPR
Used
I. As a search tool for selecting the best tree inreconstruction algorithms.
II. To quantify the dissimilarity between two phylogenetictrees.
III. To provide a lower bound on the number of reticulationevents in the case of non-tree-like evolution.
For II and III, one wants the minimum number of SPR operations totransform one phylogeny into another.
This number is the SPR distance between two phylogenies S and T.
The Mathematical Problem
MINIMUM SPRInstance: Two rooted binary phylogenetic trees S and T.Goal: Find a minimum length sequence of single SPR operations that
transforms S into T.Measure: The length of the sequence.
Notation: Use dSPR(S, T) to denote this minimum length.
Theorem (Bordewich, S 2004)MINIMUM SPR is NP-hard.
Overriding goal is to find (with no restrictions) the exact solution or aheuristic solution with a guarantee of closeness.
Algorithms for NP-Hard Problems
Fixed-parameter algorithms are a practical way to find optimalsolutions if the parameter measuring the hardness of the problemis small.
Polynomial-time approximation algorithms can efficiently find feasiblesolutions that are sometimes arbitrarily close to the optimalsolution.
Agreement Forests
A forest of T is a disjointcollection of phylogeneticsubtrees whose union of leafsets is X U r.
Example.
dcbaS
e f
r
r
dcbaF1
e f
Agreement Forests
A forest of T is a disjointcollection of phylogeneticsubtrees whose union of leafsets is X U r.
Example.
dcbaS
e f
r
r
dcbaF1
e f
Agreement Forests
An agreement forest for S and T is a forest of both S and T.
Example.
r
dcbaF1
e f dcbaF2
e f
r
afedT
b c
r
dcbaS
e f
r
Agreement Forests
An agreement forest for S and T is a forest of both S and T.
Example.
dcbaS
e f
r
r
dcbaF1
e f
afedT
b c
r
dcbaF2
e f
r
Agreement Forests
An agreement forest for S and T is a forest of both S and T.
Example.
dcbaS
e f
r
r
dcbaF1
e f
afedT
b c
r
dcbaF2
e f
r
Theorem. (Bordewich, S, 2004)Let S and T be two binary phylogenetic trees. Then
dSPR(S,T) = size of maximum-agreement forest - 1.
o It’s fast to construct from a maximum-agreement forest for Sand T a sequence of SPR operations that transforms S into T.
Reducing the Size of the Instance
Subtree Reduction Chain Reduction
Fixed-Parameter Algorithms
The underlying idea is to find an algorithm whose running timeseparates the size of the problem instance from the parameter ofinterest.
One way to obtain such an algorithm is to reduce the size of theproblem instance, while preserving the optimal value (kernalizingthe problem).
Are the subtree and chain reductions enough to kernalize theproblem?
Fixed-Parameter Algorithms
Lemma. If n’ denotes the size of the leaf sets of the fully reducedtrees using the subtree and chain reductions, then
n’ < 28dSPR(S,T).
Corollary. (Bordewich, S 2004) MINIMUM SPR is fixed-parametertractable.
1. Repeatedly apply the subtree and chain rules.2. Exhaustively find a maximum-agreement forest by deleting edges
in S and comparing with T.
Running time is O((56k)k + p(n)) compared with O((2n)k), wherek=dSPR(S,T) and p(n) is the polynomial bound for reducing thetrees using the subtree and chain reductions.
Fixed-Parameter Algorithms
A second way to obtain a fixed-parameter algorithm is using themethod of bounded search trees.
Preliminaries
o A rooted triple is a binary tree with three leaves.
o A tree S displays ab|c if the last common ancestor of ab is adescendant of the last common ancestor of ac.
cbaab|c
Fixed-Parameter Algorithms
Observation. If F is an agreementforest for S and T, then F canbe obtained from S bydeleting |F|-1 edges.
Recall. If F is maximum, then|F|-1=dSPR(S,T).
dcbaS
e f
r
afedT
b c
r
Fixed-Parameter Algorithms
Observation. If F is an agreementforest for S and T, then F canbe obtained from S bydeleting |F|-1 edges.
Recall. If F is maximum, then|F|-1=dSPR(S,T).
dcbaS
e f
r
afedT
b c
r
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
ea eb
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
er
ea eb
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
er
edea eb
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
er
edea eb
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
er
edea eb
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
er
ea eb
ee
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
er
ea eb
ee
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
er
edea eb
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
er
edea eb
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
vst
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
vst
et
es
ed
eaeb
er
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
vst
et
es
ed
eaeb
eres
et
Fixed-Parameter Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Branch into 4 computationalpaths ex, ey, ez, er, and repeatfor each path until no suchtriple.
3. For each path, find a pair ofoverlapping components ts and ttand branch into 2 paths es, et.Repeat until no suchcomponents.
dcbaS
e f
r
afedT
b c
r
ed
eaeb
eres
et
Fixed-Parameter Algorithms
Key Lemma. dSPR(S,T) <= k if and onlyif one of the computationalpaths of length at most kresults in an agreement forestfor S and T.
Theorem. (Bordewich, McCartin, S2007)
The bounded search tree algorithmhas running time O(4kn4).
Combining the methods ofkernalization and boundedsearch gives an FPT algorithmwith running time O(4kk4+p(n)).
dcbaS
e f
r
afedT
b c
r
ed
eaeb
eres
et
Approximation Algorithms
Polynomial-time approximation algorithms can efficiently find feasiblesolutions that are sometimes arbitrarily close to the optimalsolution.
An r-approximation algorithm A for MINIMUM SPR means that thesize of an agreement forest (minus 1) for S and T returned by A isat most r times dSPR(S,T).
Example. If r=3, then the size of any agreement forest (minus 1) for Sand T returned by A is at most 3 times dSPR(S,T).
Based on a pairwise comparison, Bonet, St. John, Mahindru, Amenta2006 establish a 5-approximation algorithm (O(n)) for MINIMUMSPR.
Approximation Algorithms
1. Find a minimal triple xy|z in Snot in T.
2. Delete each of ex, ey, ez, er, andrepeat until no such triple.
3. Find a pair of overlappingcomponents ts and tt and deletees, et. Repeat until no suchcomponents.
Theorem. (Bordewich, McCartin, S2007)
This gives a 4-approximation alg. forMINIMUM SPR (O(n5)).
Deleting only ex, ez, er gives a 3-approximation algorithm.
dcbaS
e f
r
afedT
b c
r
ed
eaeb
eres
et
Exactalgorithmsr=1
Maximumindependentset in planargraphs (PTAS)
1. Travellingsalesmanproblem
2. Maximumclique
No r, unlessP=NP
MINIMUM SPR
(B, S 07)1 + 1/2112
(B, M,S 07)3
5(B S,M, A 06)
Modelling Hybridization Events with SPR Operations
Reticulation processes causespecies to be a composite ofDNA regions derived fromdifferent ancestors.
Processes include
o horizontal gene transfer,
o hybridization, and
o recombination.
… molecular phylogeneticists willhave failed to find the `truetree’, not because theirmethods are inadequate orbecause they have chosenthe wrong genes, butbecause the history of lifecannot be properlyrepresented as a tree.
Ford Doolittle, 1999(Dalhousie University)
Modelling Hybridization Events with SPR Operations
A single SPR operation models a single hybridization event (Maddison1997).
Example.r
Sda cb da
Tcb
r
r
Hda cb
Modelling Hybridization Events with SPR Operations
A single SPR operation models a single hybridization event (Maddison1997).
Example.r
Sda cb da
Tcb
r
r
Hda cb
Modelling Hybridization Events with SPR Operations
A single SPR operation models a single hybridization event (Maddison1997).
Example.r
Sda cb da
Tcb
r
r
Hda cb
Modelling Hybridization Events with SPR Operations
A single SPR operation models a single hybridization event (Maddison1997).
Example.r
Sda cb da
Tcb
r
deletinghybrid edges
r
Hda cb
F
r
da cb
A Fundamental Problem for Biologists
Given an initial set of data that correctly repesents the tree-likeevolution of different parts of various species genomes,
what is the smallest number of reticulation events required thatsimultaneously explains the evolution of the species?
This smallest number
o Provides a lower bound on the number of such events.
o Indicates the extent that hybridization has had on theevolutionary history of the species under consideration.
Since 1930’s botantists have asked the question: How significant hasthe effect of hybridization been on the New Zealand flora?
Trees and Hybridization Networks
H explains T if T can be obtained from a rooted subtree of H bysuppressing degree-2 vertices.
Example.
dcbaS
dcbaH1
dbcaT
dcbaH2
Trees and Hybridization Networks
H explains T if T can be obtained from a rooted subtree of H bysuppressing degree-2 vertices.
Example.
dcbaS
dcbaH1
dbcaT
dcbaH2
Trees and Hybridization Networks
H explains T if T can be obtained from a rooted subtree of H bysuppressing degree-2 vertices.
Example.
dcbaS
dcbaH1
dbcaT
dcbaH2
The Mathematical Problem
MINIMUM HYBRIDIZATION
Instance: Two rooted binary phylogenetic trees S and T.
Goal: Find a hybridization network H that explains S and T, andminimizes the number of hybridization vertices.
Measure: The number of hybridization vertices in H.
Notation: Use h(S, T) to denote this minimum number.
Example: Arbitrary SPR operations not sufficient.
r
dcbaF1
e f
o A sequence of SPR operations that avoids creating directed cyclesto make a hybridization network that explains S and T.
o If one minimizes the length of an (acyclic) sequence, does theresulting network minimize the number of hybridization events toexplain S and T?
o YES, and such a sequence corresponds to an acyclic-agreementforest.
Theorem. (Baroni, Grünewald, Moulton, S, 2005)Let S and T be two binary phylogenetic trees. Then
h(S,T) = size of maximum-acyclic agreement forest - 1.
o It’s fast to construct from a maximum-acyclic agreement for Sand T a hybridization network that realizes h(S,T).
Theorem. (Bordewich, S, 2007)
MINIMUM HYBRIDIZATION is NP-hard.
Reducing the Size of the Instance
Subtree Reduction Chain Reduction
Fixed-Parameter Algorithms
Are the subtree and chain reductions enough to kernalize theproblem?
Lemma. If n’ denotes the size of the leaf sets of the fully reducedtrees using the subtree and chain reductions, then
n’<14h(S,T).
Corollary. (Bordewich, S 2007) MINIMUM HYBRIDIZATION isfixed-parameter tractable.
Running time is O((28k)k + p(n)) compared with O((2n)k), wherek=h(S,T) and p(n) is the polynomial bound for reducing the treesusing the subtree and chain reductions.
Reducing the Size of the Instance
Cluster Reduction (Baroni 2004)
+
A Grass (Poaceae) Dataset (Grass Phylogeny Working Group,Düsseldorf)
o Ellstrand, Whitkus, Rieseburg 1996 (Distribution of spontaneousplant hybrids)
o For each sequence, used fastDNAml to reconstruct a phylogenetictree (H. Schmidt).
Chloroplast (phyB) sequences Nuclear (ITS) sequences
Nuclear (ITS) sequencesChloroplast (phyB) sequences
Chloroplast (phyB) sequences Nuclear (ITS) sequences
Chloroplast (phyB) sequences Nuclear (ITS) sequences
Nuclear (ITS) sequencesChloroplast (phyB) sequences
Nuclear (ITS) sequencesChloroplast (phyB) sequences
Nuclear (ITS) sequencesChloroplast (phyB) sequences
h=3
h=1
h=4
h=0
620s815ITSwaxy
at least 1031ITSrpoC2
1s110waxyrpoC2
at least 929ITSrbcL
230s712waxyrbcL
29.5h1326rpoC2rbcL
19s830ITSphyB
1s314waxyphyB
180s721rpoC2phyB
1s421rbcLphyB
at least 1546ITSndhF
320s919waxyndhF
26.3h1234rpoC2ndhF
11.8h1336rbcLndhF
11h1440phyBndhF
running time2000 MHz
CPU, 2GB RAM
h(S,T)# overlappingtaxa
pairwise combination
Bordewich, Linz, St John, S, 2007
Computing dSPR(S,T) and h(S,T)
dSPR(S,T)
1. FPT using kernalization andbounded search treemethods (O(4kk4+p(n))).
2. No cluster-based reduction.
3. 3-approximation algorithm.
h(S,T)
1. FPT using kernalization only(O((28k)k +p(n))). Unknown ifa bounded search treemethod exists.
2. Cluster-based reduction.
3. Unknown if there even is anapproximation algorithm.
Reducing the Size of the Instance
Subtree Reduction Chain Reduction
an
a2
a1
S
an
a2
a1
Tx2
x1
S’
x3
x2
x1
T’
x3