Post on 21-Dec-2015
A Fully Resolved Consensus Between Fully Resolved
Phylogenetic Trees
José Augusto Amgarten QuitzauJoão Meidanis
Scylla Bioinformatics, BrazilUniversity of Campinas, Brazil
Phylogeny reconstruction methods
Phylogeny reconstruction methods aim at inferring the phylogenetic tree that best describes the evolutionary history for a set of taxa.
Which tree to choose?
“The field of systematics has been in considerable turmoil as various investigators developed different methods of classification and argued their merits. I guarantee you that no one method or view has all the good points.”
Walter M. Fitch – 1984
Consensus as tree constructor
Consensus trees have been used traditionally in tree comparison and calculation of bootstrap values
We propose the use of consensus as a tree constructor
It can be efficiently implemented as long as we keep trees fully resolved
Every edge in a phylogenetic tree divides the leaves in two subgroupssubgroups.
Each of these pairs of subgroups are splitssplits of the tree.
EF
G
H
AB
CD
Splits
Tree weight
Our method relies on weighingweighing trees and taking the one with maximum weight
Let the frequencyfrequency of a split in a collection of trees be the number of trees which contain the split divided by the total number of trees in the collection
Let the weightweight of an unrooted phylogenetic tree be the product of its splits frequencies
Most probable tree
A most probable treemost probable tree for a collection of fully resolved phylogenetic trees is a tree that maximizes the weight:
Example
Solution
w = 0.0703125
Running time
The tree weight formula can be written as a product of the frequencies of the small subgroups
We designed an algorithm that finds all most probable trees for a given set of fully resolved phylogenetic trees
The complexity of the algorithm is O(l3t2log(lt)),where l is the number of leaves and t is the number of trees
Experiments
Data setsData sets used to test the new method:
Synthetic data: from Gascuel’s LIRMM site
K2P – Kimura 2 Parameter, no MC
K2Pm – Kimura 2 Parameter, with MC
COV – Covarion model, no MC
COVm – Covarion model, with MC
Real data: Ribosomal RNA
Experiments
ProgramsPrograms used to test the new method (19):Software Method Model
fastMe Minimum evolution JC, K2P
Mega Minimum evolution JC, K2P, TN
Mega Maximum parsimony
Mega Neighbor joining JC, K2P, TN
dnacomp DNA compatibility
dnaml Maximum likelihood
dnapars Maximum parsimony
neighbor Neighbor joining JC, K2P
neighbor UPGMA JC, K2P
weighbor Weighted neighbor joining JC, K2P
Most probable = Median
Reflects general tendency
Results: average split distance
Data set Minimum Distance
K2P 43.44
K2Pm 77.78
COV 52.67
COVm 69.11
Ribosomal 60.71
Consensus consistently yields minimum average split distance
May result in better tree
Results: distance to “real” tree
Data set Consensus Not Worse Than ...
K2P 72 %
K2Pm 39 %
COV 78 %
COVm 72 %
Ribosomal 100 %
Consensus consistently not worse off than majority of input trees
… of input trees
Theoretical foundations
AB
CD
EF
G
H
All splits of a tree
AB
CD
EF
G
H AA | BCDEFGH| BCDEFGHBB | ACDEFGH| ACDEFGH
ABAB | CDEFGH| CDEFGH
CC | ABDEFGH| ABDEFGHDD | ABCEFGH| ABCEFGH
HH | ABCDEFG| ABCDEFG
GG | ABCDEFH| ABCDEFH
FF | ABCDEGH| ABCDEGHEE | ABCDFGH| ABCDFGH
CDCD | ABEFGH| ABEFGH
EFEF | ABCDGH| ABCDGH
EFGEFG | ABCDH| ABCDH
ABCDABCD | EFGH| EFGH
Small subgroup of each split
AB
CD
EF
G
H AA | BCDEFGH
BB | ACDEFGH
ABAB | CDEFGH
CC | ABDEFGH
DD | ABCEFGH
HH | ABCDEFG
GG | ABCDEFH
FF | ABCDEGH
EE | ABCDFGH
CDCD | ABEFGH
EFEF | ABCDGH
EFGEFG | ABCDH
ABCDABCD | EFGH
Small subgroups
AABB
ABAB
CCDD
HH
GG
FFEE
CDCD
EFEF
EFGEFG
ABCDABCD
Maximal clusters (n-trees)
AABB
ABAB
CCDD
HH
GG
FFEE
CDCD
EFEF
EFGEFG
ABCDABCD
Fundamental theoretical result
AA BBABAB
CC DDHH
GGFFEE
CDCD
EFEFEFGEFG
ABCDABCD
● The small subgroup set of a phylogenetic tree is always a finite set of n-treesn-trees
● There are exactly three n-trees in this set, and all n-trees are maximal if and only if the phylogenetic tree is fully resolved
Implementation details
DD EE FF GG EFEF GHGH ABCABC
Dynamic programming
DD EE FF GG EFEF GHGH ABCABC
Dynamic programming
DD EE FF GG EFEF GHGH ABCABC
Dynamic programming
DD EE FF GG EFEF GHGH ABCABC
Implementation details
DD EE FF GG EFEF GHGH
FGHFGHDEFDEFABCABCDD EE DEDE
L \
ABCABC
Implementation details
To Do List
Rooted trees
Polytomies
Non uniform weights for input trees
Acknowledgments
Scylla Bioinformatics and Institute of Computing, Unicamp, for machine time, infrastructure, and support
Brazilian Research Financing Agency CNPq, grant 470420/2004-9