Phylogenetic Signal with Induction and non-Contradiction - V Berry
-
Upload
roderic-page -
Category
Technology
-
view
521 -
download
3
Transcript of Phylogenetic Signal with Induction and non-Contradiction - V Berry
CNRS - Université
Montpellier 2France
1
Phylogenetic Signal with Induction and non-Contradiction:
the PhySIC method for building supertrees
http:/atgc.lirmm.fr/SuperTree/PhySIC
Vincent Berry1, V. Ranwez2,A. Criscuolo1,2, P.-H. Fabre2, S. Guillemot1,
C. Scornavacca1,2, E.J.P. Douzery2
Funded by ACI IMPBIO & BIOSTIC LR1 2
PhySIC: Phylogenetic Signal with Induction and non-Contradiction2
Introduction: use of supertreesSupertrees are useful for
producing well-resolved large phylogenies to provide a framework
for broad comparative studies (Gittleman et al 2004) Quantitative studies of input-tree congruence, identifying outlier
taxa by tree-supertree distance measures (Willkinson et al 2004) Exploring and identifying agreement and disagreement among sets
of input trees. The aim is then to reveal conflicts rather than resolving them. Conflict are ultimately resolved from additional data or analyses (Willkinson et al 2001)
Identifying where limited overlap between the leaf sets of the input trees is an obstacle in their amalgamation, thereby guiding further research (Sanderson et al 1996, Arné et al 2007).
PhySIC: Phylogenetic Signal with Induction and non-Contradiction3
Introduction : dealing with conflicts
Dealing with topological contradictions (“conflicts”)
among source trees : Voting methods (MRP,MMC,CLANN,…)
resolve conflicts based on a voting procedure(optimization approach)
Veto methods (Strict Consensus, Build,SMAST): do not favor any resolution in case of conflict
(consensus approach)
D
C
B
A
C
B
D
A
PhySIC: Phylogenetic Signal with Induction and non-Contradiction4
Veto methods Proceed from an axiomatic approach:
proposed supertrees satisfy specified theoretical properties
Goal:
obtain a reliable, if incomplete, picture of
how the source trees fit together
Motivation: Full congruence with the source trees can be necessary for
further applications such as phylogeography, divergence time estimations, etc.
Avoid as much as possible the inference of non-supported novel clades, unlike in some existing voting methods
PhySIC: Phylogenetic Signal with Induction and non-Contradiction5
Overview
Some relevant properties for reliable inference Decomposition of a tree into triplets Identifying a tree Property of Induction (PI) Property of non-Contradiction (PC)
Algorithms (sketch) BUILD - Aho PhySICPC
PhySICPI
Biological case study: Primate supertree
Conclusion & prospects
PhySIC: Phylogenetic Signal with Induction and non-Contradiction6
Axiomatic approach: important properties
Police investigation SuperTree
The inspector The superTree method
The witnesses The source trees
The testimonies Phylogenetic information contained within source trees
Reliable facts are those that can be induced from testimonies and that are not incompatible with any other.
Deducing the true story
Pointing out contradictions in the testimonies
Deducing new facts by cross-checking
PhySIC: Phylogenetic Signal with Induction and non-Contradiction7
Decomposition of trees in building stones
dcba
cdbe
T1T2
dca
dba
tr(T1)
dcb
cba
bc|d ac|d ab|d ab|c ed|c eb|d eb|c
tr(T2)
bd|c
ac|d
Triplets (rooted triples): subtrees on 3 taxa
PhySIC: Phylogenetic Signal with Induction and non-Contradiction8
Properties of interest: identification A tree T displays a set R of triplets
iff R tr(T) In such a case R is said to be compatible :
all triplets of R can be combined into a tree
dcba
cba
dcb
bc|d ab|c
T
ab|c ab|d
R’ does not identify TR identifies T
R identifies T iff T displays R AND every tree T’ displaying R contains all the clades of T
cdba
X
PhySIC: Phylogenetic Signal with Induction and non-Contradiction9
d
R identifies Tyet R does not contain all triples of tr(T):
additional triples are induced by those present in R
d
cb
bc|d ab|c
ab|d and ac|d are induced
c
ba
T
c
b
aR
Properties of interest: identification
PhySIC: Phylogenetic Signal with Induction and non-Contradiction10
We want to infer reliable supertrees: not making arbitrary inferences
Relevant properties: induction (PI)
we only accept supertrees T such that tr(T) is present in the data R or induced by hypotheses in RPI
dcba
ab|c ab|dac|d? cd|
b?
cba
d
ba
R
dcba
ab|c ab|dac|d?bc|d?
dcba
ab|c ab|d
PhySIC: Phylogenetic Signal with Induction and non-Contradiction11
Focusing on a coherent subset of hypotheses
R ab|c bc|d ab|d ac|d ad|c bd|c
dcba
cdba
Supertree method ? R identifies T
T
There is no chance that practical data exactly identifies a (super)tree: Lack of overlap between the source trees: missing data Errors due to gene specific evolution, systematic errors in the source
tree inference (long branch attraction, estimated model of evolution)
find a subset R’ of R identifying a tree (ie, a subtree of the underlying tree)
However, there is a chance that part of the underlying “correct” tree appears uncorrupted in the data:
PhySIC: Phylogenetic Signal with Induction and non-Contradiction12
Relevant properties: non-contradiction
we reject subsets R’ obtained by keeping xy|z and removing xz|y.
ab|c ab|d bc|d ac|d bd|c ad|c
dcba
T
R’ identifies T
R’ R
We focus on R(T), the triplets of R resolved by T
We search for a subset of R identifying a tree T
But we want to be reliable: no clade contradicted by the data
we don’t accept hypotheses that are in direct contradiction with discarded hypothesesPC
PhySIC: Phylogenetic Signal with Induction and non-Contradiction13
Link between the properties:
R(T) identifies T is equivalent to T satisfies PC: (property of non-contradiction)
for any triplet ab|c displayed by T, R(T) induces neither bc|a nor ac|b
and T satisfies PI: (property of induction)
every triplet ab|c displayed by T is induced by R(T)
Given a supertree T and a collection of source trees, PI and PC can be checked in polynomial time.
A given supertree can be modified in polynomial time so that it verifies PI and PC.
Why not designing a supertree method proposing supertrees satisfying PI and PC from the start : the PhySIC method
(Phylogenetic Signal with Induction and non-Contradiction)
PhySIC: Phylogenetic Signal with Induction and non-Contradiction14
Overview
Relevant properties for a veto method (reliable facts) Decomposition of a tree into triplets Tree identification Property of Induction (PI) Property of non-Contradiction (PC)
Algorithms (sketch) BUILD - Aho PhySICPC
PhySICPI
Biological case study: Primate supertree
Conclusion & prospects
PhySIC: Phylogenetic Signal with Induction and non-Contradiction15
Algorithmic ideas: BUILD (Aho et al 81)
a
b
c
d
d
{a,b,c}
a
b
c
c
{a,b}
a
b
a
b
cba
dcb
bc|d ab|c
dcba
R
PhySIC: Phylogenetic Signal with Induction and non-Contradiction16
Algorithmic ideas: limits of BUILD
dcba
cdba
R2bc|d bd|cac|d ad|c ab|c ab|d
a
b
c
d
dcba
dbca
R1ab|c ac|b bc|d ab|d ac|d
a
b
c
d
d
{a,b,c}
a
b
c dcba
Returns a tree only when R is compatible.
PhySIC: Phylogenetic Signal with Induction and non-Contradiction17
Algorithmic ideas: PhySICPC
dcba
cdba
Rbc|d bd|cac|d ad|c ab|c ab|d
a
b
c
d
R’bc|d bd|cac|d ad|c ab|c ab|d
d
a
b
c
cdba
At each iteration, if there is a single connected component Check if using R’ leads to several connected components If so, check that the tree will satisfy PC w.r.t. R. Or else, propose a multifurcation on those taxa
We thus obtain a more resolved tree satisfying PC: contradictions affecting basal clades do not always imped deeper clades to be obtained
Idea: temporarily forget the direct contradictions
PhySIC: Phylogenetic Signal with Induction and non-Contradiction18
Algorithmic ideas: limits of BUILD (2)
R ab|c ef|c
c
b
a
a
b
c
e
f
{a,b}
c
{e,f}
c
f
e
When the graph contains several connected components, it is necessary to check that the triplets we are about to create are really induced by R
Branches that create triplets not induced by R are collapsed (use graph algorithms)
ef|a ??
a
b
c
e
f
PhySIC: Phylogenetic Signal with Induction and non-Contradiction19
Algorithmic ideas - a summary
A supertree draft is proposed by PhySICPC ensuring PC
If a clade is not « strong enough » the corresponding branch is collapsed by PhySICPI ensuring also PI
Physic is a polynomial-time supertree method:1. Decomposition of the input forest into triplets O(kn3)
2. Creation of a tree satisfying PC O(n4)3. Collapsing edges displaying triplets not induced by the
source trees: O(n4)
the algorithm requires O(kn3+n4) computing time
PhySIC: Phylogenetic Signal with Induction and non-Contradiction20
Overview
Relevant properties for a veto method Decomposition of a tree into triplets Tree identification Property of Induction (PI) Property of non-Contradiction (PC)
Algorithms (intuitive presentation) BUILD Aho PhySICPC
PhySICPI
Biological case study: Primate supertree
Conclusion & prospects
PhySIC: Phylogenetic Signal with Induction and non-Contradiction21
Primate case study: source trees ADRA2B and IRBP study (Poux et al. 04, 06)
SINEs (Roos et al. 04)
Branches with bootstrap support <50% are collapsed
Anthropoids
PhySIC: Phylogenetic Signal with Induction and non-Contradiction22
Primate case study: PC & PI in action
ADRA2B
IRBP
Platyrrhines are unresolved due to a conflict (PC)
PhySICPC PhySIC
Arbitrary resolution among Anthropoids is removed (PI)
Source trees
PhySIC: Phylogenetic Signal with Induction and non-Contradiction23
Labels indicating source of problems
PhySIC can tell the reason for multifurcations proposed: Lack of overlap or information in the source trees (i)
Local contradictions between the source trees (c)
this guides correction/completion of source trees and primary data
PhySIC: Phylogenetic Signal with Induction and non-Contradiction24
Pointing out “problems” in other supertrees
eg, MRP is known to have some indesirable features:
inferring “novel clades” not supported by any input tree (Bininda-Emonds & Bryant 98, Goloboff & Pol 01, Goloboff 05)
being affected by a size-bias, i.e. when two trees conflict on the resolution of a clade, the tree with the smallest local sampling is ignored (Purvis 95, Bininda-Emonds & Bryant 98, Goloboff 05)
favoring source tree that are more unbalanced (Wilkinson et al 01)
A supertree already built from a collection of source trees by
an usual supertree method, can be reanalyzed in the light of
PI & PC to identify problems on some dubious nodes.
PhySIC: Phylogenetic Signal with Induction and non-Contradiction25
Primate case study: MRP tree analyzed
ADRA2B
IRBP
Source trees MRP supertree
1
12 PC
filtered MRP supertree
PhySIC: Phylogenetic Signal with Induction and non-Contradiction26
Online server: http://atgc.lirmm.fr/SuperTree/PhySIC
Contact:
PhySIC: Phylogenetic Signal with Induction and non-Contradiction27
Conclusion & prospects
appearing in the november issue of Syst.Biol.
PI and PC properties PhySIC method (http://atgc.lirmm.fr/SuperTree/PhySIC)
Supertrees satisfying PI and PC (exact) and as much resolved as possible (heuristics)
Proposes very reliable supertrees: identified by the data (low type-I err) Polynomial-time method Localization of conflicts and areas with insufficient overlap Enables to check/correct supertrees built by other methods (MRP, …).
Further developments: Producing more resolved trees satisfying PC et PI Filtering triplets based on their frequencies Coupling with a database (TreeBase, …)
PhySIC: Phylogenetic Signal with Induction and non-Contradiction28
Thanks
Emmanuel Douzery
Vincent Ranwez
Alexis Criscuolo
Sylvain Guillemot
Pierre-Henri Fabre
Celine Scornavacca
Vincent Lefort
Equipe Méth. et Algor. pour la bioinf.
LIRMM Equipe Phylogénie Moléculaire
ISEM