Comparing Phylogenies

Norbert ZehDalhousie University

Kernelization, Depth-Bounded Search and Beyond

Phylogenetic Trees

Halophiles

Methanosarcina

Methanobacterium

Methanococcus

T. celerThermoprot.

Pyrodicticum

EntamoebaeSlimemolds Animals

PlantsCiliates

Flagellates

Trichomonads

Microsporidia

Diplomonads

GreenFilamentous

bacteriaSpirochetesGram

positives

Proteobacteria

Cyanobacteria

Planctomyces

BacteroidesCytophaga

Thermotoga

Aquifex

Archaea EucaryotaBacteria

Norbert Zeh

Reticulation Events

Lateral gene transfer (subtree prune-and-regraft)

a b c d e

Norbert Zeh

Reticulation Events

a b c d e

Norbert Zeh

Reticulation Events

a b d eca b c d e

Norbert Zeh

Reticulation Events

Hybridization

a b d eca b c d e

a b c d eComparing Phylogenies

Norbert Zeh

Reticulation Events

Hybridization

a b d eca b c d e

Norbert Zeh

Reticulation Events

Hybridization

a b d eca b c d e

Norbert Zeh

Reticulation Events

Zeina the ZonkeyOwklawn Farm Zoo, Nova Scotia

Hybridization

a b d eca b c d e

Norbert Zeh

Reconciling Phylogenies

1. Tree distances

• SPR distance: number of SPR operations to transform one tree into theother

NP-hard [Bordewich/Semple 2005]

• Hybridization number: minimum number of nodes with two parents inany network that displays both trees

NP-hard [Bordewich/Semple 2007]

• Robinson-Foulds distance: number of bipartitions that disagree

Linear-time, but . . .

a b c d e f

fe dcba

a b c d e f

f edc ba

Norbert Zeh

2. Supertrees

• MRP supertrees [Ragan 1992]• RF supertrees [Bansal et al. 2010]• SPR supertrees [Whidden/Zeh/Beiko 2012]• . . .

c ge f

a b c d

c ge f

Norbert Zeh

2. Supertrees

c ge f

a b c d

Norbert Zeh

2. Supertrees

c ge f

a b c d

Norbert Zeh

2. Supertrees

c ge f

a b c d

c ge f

Norbert Zeh

2. Supertrees

3. Phylogenetic networks

• DLT networks [Hallet/Lagergren 2011, Doyon et al. 2011]• Recombination networks [Gusfield et al. 2003]• Level-k hybridization networks [van Iersel/Kelk 2011]• MAAF of multiple trees [Chen/Wang 2012]

c ge f

a b c d

c ge f

Norbert Zeh

Computing SPR Distance

Norbert Zeh

Agreement Forests

How many SPR operations does ittake to turn T1 into T2?

a b c d e f

f edcba

a f d b c e

Norbert Zeh

Agreement Forests

[Bordewich/Semple 2005]

What is the largest forest we canobtain from T1 and T2 using edgedeletions and forced contractions?

a b c d e f

f edcba

a f d b c e

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

What is the smallest hybridizationnetwork that displays both T1 and T2?

a b c d e f

fedcb a

afd bc e

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

a b c d e f

fedcb a

afd bc e

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

a b c d e f

fedcb a

afd bc e

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

What is the largest acyclic agreementforest of T1 and T2?

a b c d e f

fedcb a

afd bc e

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

a b c d e f

fedcb a

afd bc e

agreement forest

a b c d e f

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

a b c d e f

fedcb a

afd bc e

agreement forest

a b c d e f

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

a b c d e f

fedcb a

afd bc e

agreement forest

a b c d e f

acyclicagreement forest

a b c d e f

Norbert Zeh

Agreement Forests

a b c d e f

f edcba

a f d b c e

a b c d e f

fedcb a

afd bc e

agreement forest

a b c d e f

acyclicagreement forest

a b c d e f

Norbert Zeh

Kernelization for Maximum Agreement Forest (SPR Distance)

Rule 1: Prune agreeing subtrees

Rule 2: Compress agreeing chains

Running time: O((56k)k + poly(n)) [Bordewich/Semple 2005]

c d e f c d e f

⇒ A A

Norbert Zeh

Depth-Bounded Search for MAF [Whidden/Zeh 2009]

An MAF of T1 and T2 can be obtained by cutting edges in F2.

Norbert Zeh

Case 1: A whole tree in F2 agrees with a subtree of T1

Norbert Zeh

Case 1: A whole tree in F2 agrees with a subtree of T1

Case 2: Two agreeing subtrees are adjacent in T1 and F2

Norbert Zeh

Case 3: Subtrees A and B are adjacent in T1 but not in F2

One branch per edge

a′ b′

Norbert Zeh

Case 3: Subtrees A and B are adjacent in T1 but not in F2

• Number of recursive calls = 3k

• Each costs O(n) time

Running time: O(3kn)

One branch per edge

a′ b′

Norbert Zeh

Improved Branching Rules [Whidden/Beiko/Zeh 2010]

Case 3.1: a and b belong to different subtrees of F2

2 recursive callswith parameterk− 1

a′ c′

Norbert Zeh

Case 3.1: a and b belong to different subtrees of F2

Case 3.2: One pendant subtree on path from a to b in F2

2 recursive callswith parameterk− 1

1 recursive callwith parameterk− 1

a′ c′

a′ b′

Norbert Zeh

Case 3.3: m≥ 2 pendant subtrees on path from a to b in F2

3 recursive callswith parameters• k− 1• k− 1• k′ ≤ k− 2a c

a′ c′

Norbert Zeh

Case 3.3: m≥ 2 pendant subtrees on path from a to b in F2

3 recursive callswith parameters• k− 1• k− 1• k′ ≤ k− 2

Number of recursive invocations

I(k)≤ 2I(k− 1) + I(k− 2)≤ (1+p

2)k ≈ 2.41k

a′ c′

Norbert Zeh

Multifurcating Trees [Whidden/Beiko/Zeh 2012]

Problem 1: What is a meaningful definition of an agreement forest?

Norbert Zeh

An MAF of two multifurcating phylogenies T1 and T2 is the largest forest thatis an AF of two binary resolutions of T1 and T2.

Norbert Zeh

An MAF of two multifurcating phylogenies T1 and T2 is the largest forest thatis an AF of two binary resolutions of T1 and T2.

Problem 2: Sibling pairs become sibling groups.

a1 a2 a3 a4 a1 a2 a3

Norbert Zeh

It’s FPT, alright . . .

• 5 cases depending on the structure of F2• The worst: I(k) = 1+ 2I(k− 1) + 3I(k− 2)

Norbert Zeh

. . . but it ain’t fast.

Norbert Zeh

a1 a2 a3 a4 a1 a2 a3

B1,1 B3,1B2,1

Norbert Zeh

• Until the protected edges are eliminated, every recursive call becomes a2-way branch.

• Each such sequence of 2-way branches ends in a “1-way branch”.

Running time: O(2.42kn)

a1 a2 a3 a4 a1 a2 a3

B1,1 B3,1B2,1

Norbert Zeh

Binary Trees Even Faster [Whidden/Beiko/Zeh 2012]

• Edge protection idea from the multifurcating algorithm

• A couple of new cases

• A hairy analysis

Running time: O(2kn)

Norbert Zeh

Clustering [Linz/Semple 2009]

An MAF of the two input trees can be computed by computing MAFs of theclusters . . . with a twist.

e f g h

iα1 α2 d

iα1α2

a b c d e f

g h i a bcd e f

Norbert Zeh

Branch and Bound

• For each invocation, compute 3-approximation k′ of number of edges leftto be cut.

• If k′ > 3k, abort.

Added cost per invocation: O(n) [Whidden/Zeh 2009]

k′ > 3k

not explored

Norbert Zeh

Experimental Results

Norbert Zeh

Hybridization [Whidden/Beiko/Zeh 2012]

Observation: While F2 is not an AF of T1 and T2, at least one of the branchesin each case of the MAF algorithm makes progress towards an MAAF.

Norbert Zeh

Case 3.2′: One pendant subtree on path from a to b in F2

2 recursive callswith parameterk− 1a c

a′ c′

a′ b′

Norbert Zeh

Case 3.2′: One pendant subtree on path from a to b in F2

2 recursive callswith parameterk− 1a c

a′ c′

a′ b′

Once an AF is obtained, cut edges to eliminate cycles.

Norbert Zeh

Cycle graph

a b c d e f g h i

a b c d e f gh i

Norbert Zeh

Breaking cycles

• 2k edges between components• For each, may need to eliminate the path to the root of the parent

component

⇒ O(22k · 2.42kn) = O(9.68kn) time

Norbert Zeh

Breaking cycles

• 2k edges between components• For each, may need to eliminate the path to the root of the parent

component

⇒ O(22k · 2.42kn) = O(9.68kn) time

Reducing the number of candidate edges

• Can get away with considering only k of the 2k edges

⇒ O(2k · 2.42kn) = O(4.84kn) time

Norbert Zeh

A better analysis

• If the AF has k′ ≈ k edges, the refinement step considers� k

k−k′�

� 2k

choices• If the AF has k′ ≈ 0 edges, the refinement step considers at most 2k′ � 2k

choices• If the AF has k′ ≈ k/2 edges, the refinement step considers

� kk−k′�

≈ 2k

choices, but this situation can arise only 2.42k′ � 2.42k times

⇒ O(3.18kn) time

Norbert Zeh

Application: SPR Supertrees

Norbert Zeh

SPR Supertrees [Whidden/Zeh/Beiko 2012]

Open problem: Computational complexity of computing an optimal SPRsupertree.

Norbert Zeh

Heuristic

• Build up initial supertree• Iterative improvement using SPR operations

Norbert Zeh

Heuristic

Initial tree construction

• Start with 4-leaf tree consistent with one of the gene trees• Attach one leaf at a time• For each leaf, choose the location that minimizes SPR distance

Norbert Zeh

Heuristic

Initial tree construction

• Start with 4-leaf tree consistent with one of the gene trees• Attach one leaf at a time• For each leaf, choose the location that minimizes SPR distance

Iterative improvement

• Try all O(n2) SPR operations on current supertree and choose the one thatminimizes the SPR distance from gene trees

Norbert Zeh

Limit number of SPR moves to consider

• Consider only SPR operations across r = O(1) edges⇒ O(n) moves

⇒ O(tn) exact SPR computations

• Rank moves based on approximate SPR distance of resulting tree to genetrees

• Try moves in this order and choose the first one that gives an improvement

⇒ O(tn2) approximate SPR computations + O(t) exact SPR computations

Norbert Zeh

Limit number of SPR moves to consider

• Consider only SPR operations across r = O(1) edges⇒ O(n) moves

⇒ O(tn) exact SPR computations

• Rank moves based on approximate SPR distance of resulting tree to genetrees

• Try moves in this order and choose the first one that gives an improvement

⇒ O(tn2) approximate SPR computations + O(t) exact SPR computations

MAF-driven improvements

• In each iteration, every gene tree initiates one SPR move on supertree thatreduces its distance by one

• Choose this move using the MAF of gene tree and supertree

⇒ t exact SPR computations

Norbert Zeh

Conclusions

Norbert Zeh

Ongoing and Future Work

Faster supertree search

• FPT approximation to handle really large trees

Norbert Zeh

Faster M(A)AF algorithms

• Substantially break the 2k barrier to handle trees with 1,000s of leaves

Norbert Zeh

Faster M(A)AF algorithms

• Substantially break the 2k barrier to handle trees with 1,000s of leaves

Compute all M(A)AFs [Abrecht et al. 2012]

• Provide more biological insight

Norbert Zeh

Comparing Phylogenies

Documents

Transcript of Comparing Phylogenies

BME 130 – Genomes Lecture 26 Molecular phylogenies I.

Macroevolution Part I: Phylogenies

Chapter 12 (Strikberger) Molecular Phylogenies and ... · EEES 4150/5150, Evolution Sigler 1 Chapter 12 (Strikberger) Molecular Phylogenies and Evolution METHODS FOR DETERMINING PHYLOGENY

Information Theoretic Approach to Whole Genome Phylogenies

On the Hardness of Inferring Phylogenies from Triplet-Dissimilarities

Evaluating the Fossil Record with Model Phylogenies

Webb Et Al 2002 Phylogenies and Community Ecology

A Statistical Test of Phylogenies Estimated from Sequence Datachagall.med.cornell.edu/BioinfoCourse/PDFs/Lecture6/li1989.pdf · A Statistical Test of Phylogenies Estimated from Sequence

Molecular phylogenies challenge the classification of ... · ORIGINAL ARTICLE Molecular phylogenies challenge the classification of Polymastiidae (Porifera, Demospongiae) based on

Chapter 15 Phylogenies and Classifying Diversity.

Chapter 4 Graphical Methods for Visualizing Comparative ... · Graphical Methods for Visualizing Comparative Data on Phylogenies Liam J. Revell Abstract Phylogenies have emerged as

Lec 10 Phylogenies And Origin Of Behaviors

Phylogenies & Classifying species (AKA Cladistics & Taxonomy)

1 Austronesian language phylogenies: myths and ...indoeuropean.wdfiles.com/local--files/abstract/SS2017_Gray... · 1 Austronesian language phylogenies: myths and misconceptions about

Introduction to Phylogenies Dr Laura Emery Laura.Emery@ebi.ac.uk .

Visualing phylogenies: a personal view

Approximating the subtree distance between phylogenies. C Semple

25 Reconstructing and Using Phylogenies. 25 Phylogenetic Trees Steps in Reconstructing Phylogenies Reconstructing a Simple Phylogeny Biological Classification.

Phylogenies and the Tree of Life

Recombination, Phylogenies and Parsimony 21.11.05