Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs &...
Transcript of Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs &...
![Page 1: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/1.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Why should we care about cograph heuristics?Phylogenomics with paralogs
Marc Hellmuth
Faculty of Mathematics and Computer ScienceSaarland University, Germany
Joint work with:Nicolas Wieseke, Peter F Stadler, Martin Middendorf
Maribel Hernandez Rosales (U Leipzig)Katherina Huber, Vincent Moulton (U East Anglia),
Hans-Peter Lenhof (U Saarland), Marcus Lechner (U Marburg)
30TH TBI WINTERSEMINAR, BLED 2015
1 / 24
![Page 2: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/2.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
1. Phylogeny and Basics
2. Orthology, Paralogy and Gene Trees - Cograph Editing
3. Inferring Species Trees
4. Results
2 / 24
![Page 3: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/3.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Phylogenetics
A B C D
F
G
E
3 / 24
![Page 4: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/4.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Phylogenetics
• species are characterized byits genome: a “bag of genes”
• “Genes” evolve along a rootedtree
• unique event labelingt : V 0 → M = {•,�}
two types of branching events: a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D
3 / 24
![Page 5: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/5.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Phylogenetics
• species are characterized byits genome: a “bag of genes”
• “Genes” evolve along a rootedtree
• unique event labelingt : V 0 → M = {•,�}
two types of branching events: a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D
1. Gene duplication: an offspring has two copiesof a single gene of its ancestor
2. Speciation: two offspring species inherit theentire genome of their common ancestor
3 / 24
![Page 6: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/6.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
The Problem in Practice
a1 a2 b1 b2 c1 c3 d1 d3d2
A B C D
• Only the subset of leaves of the gene tree corresponding to genesin extant (currently living) species is observable.
• All internal nodes and the event labelling t in the gene tree must beinferred from data.
• The events and the topology of the gene tree can be used (underseveral constraints) to infer the species tree (Reconciliation)
4 / 24
![Page 7: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/7.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
The Problem in Practice
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D
• Only the subset of leaves of the gene tree corresponding to genesin extant (currently living) species is observable.
• All internal nodes and the event labelling t in the gene tree must beinferred from data.
• The events and the topology of the gene tree can be used (underseveral constraints) to infer the species tree (Reconciliation)
4 / 24
![Page 8: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/8.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
The Problem in Practice
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D A B C D
F
G
E
• Only the subset of leaves of the gene tree corresponding to genesin extant (currently living) species is observable.
• All internal nodes and the event labelling t in the gene tree must beinferred from data.
• The events and the topology of the gene tree can be used (underseveral constraints) to infer the species tree (Reconciliation)
4 / 24
![Page 9: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/9.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Orthologs and Paralogs
Orthology and paralogy are importantconcepts in evolutionary biology and aredefined in terms of the pair (T , t).
Two genes x and y are
• orthologs ift(lca(x ,y)) = •=speciation
• paralogs ift(lca(x ,y)) =�=duplication
Duplication
Paralogs
Speciation
Orthologs
5 / 24
![Page 10: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/10.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
State-of-the-Art Tree Reconstruction
genome A
genome B
genome C
genome D
• Find 1:1-orthologs.
• Paralogs = dangerous nuisance that has to be detected andremoved.
• Select families of genes that rarely exhibit duplications(e.g. rRNAs, ribosomal proteins)
6 / 24
![Page 11: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/11.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
State-of-the-Art Tree Reconstruction
genome A
genome B
genome C
genome D
• Find 1:1-orthologs.
• Paralogs = dangerous nuisance that has to be detected andremoved.
• Select families of genes that rarely exhibit duplications(e.g. rRNAs, ribosomal proteins)
• Alignments of protein or DNA sequences and standart techniquesyield evolutionary history that is believed to be congruent to that ofthe respective species.
6 / 24
![Page 12: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/12.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
State-of-the-Art Tree Reconstruction
genome A
genome B
genome C
genome D
Pitfalls:
• The set of usable gene sets is strongly restricted (≤ 10%).
• Information of evolutionary events as paralogs or xenologs isignored.
• It is often mistakenly assumed that the orthology relation istransitive.
6 / 24
![Page 13: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/13.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Orthologs and Paralogs
Two genes x and y are
• orthologs ift(lca(x ,y)) = •=speciation
• paralogs ift(lca(x ,y)) =�=duplication
Duplication
Paralogs
Speciation
Orthologs
=⇒ orthology relation Θ can be estimated directly from the data,without constructing either gene or species treese.g. with ProtheinOrtho or its extension PoFF
ProteinOrtho: Detection of (Co)orthologs in large-scale analysis. , Lechner M,Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ, BMC Bioinformatics, 2011
7 / 24
![Page 14: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/14.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Estimating Θ directly from the data
The relation Θ̂ is only an estimate of a “correct” orthology relation Θ.
Aim: Correct initial estimate Θ̂ to the “closest” orthology relation Θthat fits the data and build corresponding gene and species trees.
=⇒ What is a “closest” orthology relation Θ?
8 / 24
![Page 15: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/15.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Characterization of ΘQuestion: When does the initial estimate Θ̂ fit the data?
Equivalently we can ask for a "symbolic representation":
For a given Θ̂ when does there exist a tree T with event labeling t s.t.
• t(lca(x ,y)) = •= speciation for all (x ,y) ∈ Θ̂ and
• t(lca(x ,y)) =�= duplication for all (x ,y) 6∈ Θ̂?
v0
v1
v2
v3 v4 v2 v4 v3v0v1
GΘ̂
with edge set Θ̂ = {(v0,v2),(v0,v4),(v2,v3),(v3,v4)}9 / 24
![Page 16: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/16.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Characterization of ΘQuestion: When does the initial estimate Θ̂ fit the data?
Equivalently we can ask for a “symbolic representation”:
For a given Θ̂ when does there exist a tree T with event labeling t s.t.
• t(lca(x ,y)) = •= speciation for all (x ,y) ∈ Θ̂ and
• t(lca(x ,y)) =�= duplication for all (x ,y) 6∈ Θ̂?
We used results by Böcker & Dress (1998) on “symbolic ultrametrics”:
TheoremThe following conditions are equivalent
• There is a symbolic representation for Θ̂.
• GΘ̂
is a Cograph.
Recovering Symbolically Dated, Rooted Trees from Symbolic Ultrametrics , Böcker & Dress, Adv. Math., 1998
Orthology Relations, Symbolic Ultrametrics, and Cographs , Hellmuth M, H.-Rosales M, Huber K, Moulton V,Stadler PF, Wieseke N, J. Math. Biol., 2012
10 / 24
![Page 17: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/17.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Cograph (=Complement reducible graph)
Corneil et al., 1981:
Cographs are defined recursively (Def. omitted)
G is Cograph IFF G is “induced P4-free”
Forbidden:
Allowed:
Complement reducible graphs , Corneil DG, Lerchs H, Steward Burlingham L, Discr. Appl. Math., 198111 / 24
![Page 18: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/18.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Cograph (=Complement reducible graph)
Corneil et al., 1981:
Cographs are defined recursively (Def. omitted)
G is Cograph IFF G is “induced P4-free”
Every Cograph is associated with a unique Cotree.
v0
v1
v2
v3 v4
11 / 24
![Page 19: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/19.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Cograph (=Complement reducible graph)
Corneil et al., 1981:
Cographs are defined recursively (Def. omitted)
G is Cograph IFF G is “induced P4-free”
Every Cograph is associated with a unique Cotree.
v0
v1
v2
v3 v4
C
0
v1
11 / 24
![Page 20: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/20.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Cograph (=Complement reducible graph)
Corneil et al., 1981:
Cographs are defined recursively (Def. omitted)
G is Cograph IFF G is “induced P4-free”
Every Cograph is associated with a unique Cotree.
v0
v1
v2
v3 v4
1
C C
0
v1
11 / 24
![Page 21: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/21.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Cograph (=Complement reducible graph)
Corneil et al., 1981:
Cographs are defined recursively (Def. omitted)
G is Cograph IFF G is “induced P4-free”
Every Cograph is associated with a unique Cotree.
v0
v1
v2
v3 v4
1
0
v2 v4
0
v3v0
0
v1
11 / 24
![Page 22: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/22.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Cograph (=Complement reducible graph)
Corneil et al., 1981:
Cographs are defined recursively (Def. omitted)
G is Cograph IFF G is “induced P4-free”
Every Cograph is associated with a unique Cotree.
v0
v1
v2
v3 v4
1
0
v2 v4
0
v3v0
0
v1
11 / 24
![Page 23: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/23.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Cograph (=Complement reducible graph)
Corneil et al., 1981:
Cographs are defined recursively (Def. omitted)
G is Cograph IFF G is “induced P4-free”
Every Cograph is associated with a unique Cotree.
v0
v1
v2
v3 v4 v2 v4 v3v0v1
(x ,y) ∈ E(G) = Θ if and only if lca(x ,y) = 1 = •11 / 24
![Page 24: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/24.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Characterization of ΘIdea: Correct the initial estimate Θ̂ to the “closest” orthology relation Θ
that fits the data.
TheoremThere is a symbolic representation (T , t) for Θ̂ ⇐⇒ G
Θ̂is a Cograph.
There is a symbolic representation (T , t) for any symbolic relation(=colored graph G) ⇐⇒ each monochromatic subgraph is a Cographand on each triangle in G at most 2 colors are used.
Orthology Relations, Symbolic Ultrametrics, and Cographs , Hellmuth M, H.-Rosales M, Huber K, Moulton V,Stadler PF, Wieseke N, J. Math. Biol., 2012
12 / 24
![Page 25: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/25.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Characterization of ΘIdea: Correct the initial estimate Θ̂ to the “closest” orthology relation Θ
that fits the data.
TheoremThere is a symbolic representation (T , t) for Θ̂ ⇐⇒ G
Θ̂is a Cograph.
There is a symbolic representation (T , t) for any symbolic relation(=colored graph G) ⇐⇒ each monochromatic subgraph is a Cographand on each triangle in G at most 2 colors are used.
v0
v1
v2
v3 v4 v2 v4 v3v0v1
Orthology Relations, Symbolic Ultrametrics, and Cographs , Hellmuth M, H.-Rosales M, Huber K, Moulton V,Stadler PF, Wieseke N, J. Math. Biol., 2012
12 / 24
![Page 26: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/26.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Finding the species trees
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D
13 / 24
![Page 27: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/27.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Finding the species trees
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D A B C D
F
G
E
13 / 24
![Page 28: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/28.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Finding the species trees
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D A B C D
F
G
E
Infer local topologies of the species tree from the gene tree:S= {AB|C,AB|D,CD|A,CD|B}
From Event-Labeled Gene Trees to Species Trees. , H.-Rosales M, Hellmuth M, Huber K, Moulton V, Wieseke N,Stadler PF, BMC Bioinformatics, 2012
13 / 24
![Page 29: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/29.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Finding the species trees
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D A B C D
F
G
E
Infer local topologies of the species tree from the gene tree:S= {AB|C,AB|D,CD|A,CD|B}
Theorem. Based on S it can be verified in polynomial time if there is aspecies tree where the gene tree can be embedded into.If there is a species tree for the gene tree, the species tree & embeddingcan be computed in polynomial time.
From Event-Labeled Gene Trees to Species Trees. , H.-Rosales M, Hellmuth M, Huber K, Moulton V, Wieseke N,Stadler PF, BMC Bioinformatics, 2012
13 / 24
![Page 30: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/30.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Workflow
Θ
Cograph
Editing
Θ∗
Species
Triple
Extraction
(αβ|γ)
(αγ|β)
(αγ|δ)
(βδ|α)
(βδ|γ)
S =
Max.
Consistent
Triple Set
(αγ|β)
(αγ|δ)
(βδ|α)
(βδ|γ)
S∗ =Build
Tree
α γ β δ
We formulated all NP-hard problems (CE, MCT, LRT) as Integer LinearProgram (ILP):
minF (x) s.t. Ax ≤ b
Phylogenomics with Paralogs , Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF, PNAS,2015
14 / 24
![Page 31: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/31.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation
The entire worflow as ILP is implemented in the Software ParaPhylousing IBM ILOG CPLEX™ Optimizer 12.6.
It is freely available frompacosy.informatik.uni-leipzig.de/paraphylo
Artificial data generated with ALF:
• generate binary species tree• simulate dupl./loss/HGT history ofgene sequences
A B C D
F
G
E
ALF-a simulation framework for genome evolution. , Dalquen et al., Mol. Biol. Evol., 201215 / 24
![Page 32: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/32.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation
The entire worflow as ILP is implemented in the Software ParaPhylousing IBM ILOG CPLEX™ Optimizer 12.6.
It is freely available frompacosy.informatik.uni-leipzig.de/paraphylo
Artificial data generated with ALF:
• generate binary species tree• simulate dupl./loss/HGT history ofgene sequences
��������duplication
speciation
A B C D
ALF-a simulation framework for genome evolution. , Dalquen et al., Mol. Biol. Evol., 201215 / 24
![Page 33: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/33.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation
The entire worflow as ILP is implemented in the Software ParaPhylousing IBM ILOG CPLEX™ Optimizer 12.6.
It is freely available frompacosy.informatik.uni-leipzig.de/paraphylo
Artificial data generated with ALF:
• generate binary species tree• simulate dupl./loss/HGT history ofgene sequences
��������duplication
speciation
A B C D
����
��
��
ALF-a simulation framework for genome evolution. , Dalquen et al., Mol. Biol. Evol., 201215 / 24
![Page 34: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/34.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation
The entire worflow as ILP is implemented in the Software ParaPhylousing IBM ILOG CPLEX™ Optimizer 12.6.
It is freely available frompacosy.informatik.uni-leipzig.de/paraphylo
Artificial data generated with ALF:
• generate binary species tree• simulate dupl./loss/HGT history ofgene sequences
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D
ALF-a simulation framework for genome evolution. , Dalquen et al., Mol. Biol. Evol., 201215 / 24
![Page 35: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/35.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation
The entire worflow as ILP is implemented in the Software ParaPhylousing IBM ILOG CPLEX™ Optimizer 12.6.
It is freely available frompacosy.informatik.uni-leipzig.de/paraphylo
Artificial data generated with ALF:
• generate binary species tree• simulate dupl./loss/HGT history ofgene sequences
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D
a3 b3 c4 c5 d4 d5
ALF-a simulation framework for genome evolution. , Dalquen et al., Mol. Biol. Evol., 201215 / 24
![Page 36: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/36.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation 1
ALF (no HGT)
a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D
a3 b3 c4 c5 d4 d5
−→ The cograph GΘ is directly accessible−→ Compute cotree of GΘ
−→ Extract the species triples set S (consistent)−→ Compute least resolved species tree and compare it
with initial species tree
16 / 24
![Page 37: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/37.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation 1Accuracy of reconstructed species trees as function of number ofindependent gene families:
100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
1.0
# gene families
TT
dis
tance
100 200 300 400 5000.0
0.2
0.4
0.6
0.8
1.0
# gene families
TT
dis
tance
10 species 20 species
Simulation with ALF with duplication/loss rate 0.005(∼ 8% duplications) and no HGT.
TT distance =̂ “num different triples in initial and reconstructedspecies tree”
17 / 24
![Page 38: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/38.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Phylogenomics with Paralogs
In our model: (x ,y) /∈Θ iff the distinct genes x and y are paralogs
0
12
34
GΘ
0
12
34
(T, t)
If ∄ paralogs → GΘ is a clique → gene tree is a star → no species triplescan be inferred.
To obtain fully resolved species trees, a sufficient number of geneduplications must have occurred, since the phylogenetic informationutilized by our approach is entirely contained in the duplication events.
18 / 24
![Page 39: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/39.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation 1Accuracy of reconstructed species trees as function of number ofindependent gene families:
100 200 300 400 500
0.0
0.2
0.4
0.6
0.8
1.0
# gene families
TT
dis
tance
100 200 300 400 5000.0
0.2
0.4
0.6
0.8
1.0
# gene families
TT
dis
tance
10 species 20 species
Average TT distance always smaller than 0.09 for more than 300 genefamilies, independent from the number of species.
Deviations from perfect reconstructions are exclusively explained by alack of perfect resolution.
19 / 24
![Page 40: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/40.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation - Noise
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
% orthologous noise
TT
dis
tanc
e
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
% paralogous noise
TT
dis
tanc
e
• ALF (10 species and 1000 gene families) - GΘ as before - add noise- start ILP-pipeline (CE→MCS→LRT).
• orthologous noise (overpredicting): flip paralogs with prob. p
• paralogous noise (underpredicting): flip orthologs with prob. p
• p ∈ [0.05,0.25]
20 / 24
![Page 41: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/41.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Simulation - Noise
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
% orthologous noise
TT
dis
tanc
e
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
% paralogous noise
TT
dis
tanc
e
orthologous noise: additional edges in GΘ
−→ GΘ becomes more clique-alike−→ less species triples can be inferred
and thus, less wrong species triples
paralogous noise: remove edges from GΘ
−→ GΘ becomes less clique-alike−→ more species triples can be inferred
and thus, more more wrong species triples20 / 24
![Page 42: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/42.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Runtime
Table: Running time in seconds on 2 Six-Core AMD Opteron™
Processors with 2.6GHz for individual sub-tasks: CE cograph editing,MCS maximal consistent subset of triples, LRT least resolved tree.
Data CE MCS LRT Total a
Simulationsb 125c < 1 < 1d 126Aquificalese 34 < 1 < 1 (6)g 34Enterobacterialesf 2673 2 < 1 (1749)g 2676
a Total time includes triple extraction, parsing input, and writing output files.b Average of 2000 simulations with ALF, 10 species, 1000 gene families.
100 runs for each 4 noise models with different p ∈ {0.05,0.1,0.15,0.2,0.25}c 2,000,000 cographs, 41 not optimally solved within time limit of 30 min.d In 95.95% of the simulations the LRT could be found using BUILD.e 11 Aquificales species with 2887 gene families.f 19 Enterobacteriales species with 8308 gene families.g A unique tree was obtained using BUILD. Second value indicates running timewith ILP solving enforced.
21 / 24
![Page 43: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/43.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Horizontal gene transfer (HGT)HGT refers to the transfer of genes between organisms in a mannerother than traditional reproduction (sexual or asexual reproduction) andacross different species (e.g. as in bacteria).
• speciation� duplicationN horizontal gene transfer
X X
X X
a b c1 c2 d
A B C D
22 / 24
![Page 44: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/44.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Horizontal gene transfer (HGT)
Dependence on the intensity of horizontal gene transfer:
0.0
0.2
0.4
0.6
TT
dis
tanc
e
0.0 15.6 28.7 39.5
0.0
0.2
0.4
0.6
0.8
1.0
% horizontal gene transfer0.0 15.6 28.7 39.5
0.0
0.2
0.4
0.6
0.8
1.0
% horizontal gene transfer0.0 15.6 28.7 39.5
0.0
0.2
0.4
0.6
0.8
1.0
% horizontal gene transfer
ProteinOrtho perfect orthology perfect paralogyknowledge knowledge
ALF: 10 species, 1000 gene families, duplication/loss rate 0.005 andHGT rate ranging from 0.0 to 0.0075.
23 / 24
![Page 45: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/45.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Conclusion
Θ
Cograph
Editing
Θ∗
Species
Triple
Extraction
(αβ|γ)
(αγ|β)
(αγ|δ)
(βδ|α)
(βδ|γ)
S =
Max.
Consistent
Triple Set
(αγ|β)
(αγ|δ)
(βδ|α)
(βδ|γ)
S∗ =Build
Tree
α γ β δ
In “classical standart” approaches, paralogs are treated as a dangerousnuisance that has to be detected and removed.
However, paralogy is the key!
Summary of Results here:Phylogenomics with Paralogs. Hellmuth, Wieseke, Lechner, Lenhof,Middendorf, Stadler, PNAS, 2015
24 / 24
![Page 46: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/46.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Conclusion
Θ
Cograph
Editing
Θ∗
Species
Triple
Extraction
(αβ|γ)
(αγ|β)
(αγ|δ)
(βδ|α)
(βδ|γ)
S =
Max.
Consistent
Triple Set
(αγ|β)
(αγ|δ)
(βδ|α)
(βδ|γ)
S∗ =Build
Tree
α γ β δ
1. Improve orthology inference tools.
2. Develop paralogy inference tools.
3. Efficient heuristics for the cograph editing and least resolved tree P.
4. On parts in GΘ that are cliques incorporate “classical” approaches.
5. Generalization of mathematical phylogenetic framework to dealexactly with HGT and with phylogenetic networks. 24 / 24
![Page 47: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/47.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
THANK YOU!
24 / 24
![Page 48: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/48.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Symbolic Ultrametrics
The map δ : X ×X → M⊙ is said to be a symbolic ultrametric (on X ) if thefollowing conditions are satisfied
(U0) δ (x ,y) =⊙ if and only if x = y .
(U1) δ (x ,y) = δ (y ,x) for all x ,y ∈ X .
(U2) |{δ (x ,y),δ (x ,z),δ (y ,z)}| ≤ 2 for all x ,y ,z ∈ X ; and
(U3) there are no four pairwise distinct elements x , y , u, and v of X suchthat
δ (x ,y) = δ (y ,u) = δ (u,v) 6= δ (y ,v) = δ (x ,v) = δ (x ,u)
Note: every ultrametric induces a symbolic ultrametric.
24 / 24
![Page 49: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/49.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Sketch: Estimating Θ directly from the Data
• We know the assignment of genes to species and we can measuresimilarity s(x ,y) of two genes using sequence alignments andblast bit scores
• y ∈ B is a (putative) ortholog of x ∈ A,in symbols (x ,y) ∈ Θ̂, if
1. A 6= B,orthologs are never found in the same
species
2. s(x ,y)≈ maxz∈B
s(x ,z),
if x and y are orthologs, then they do
not have (much) closer relatives in the
two species.a1 a2 b1 b2 c1 c3
x
d1 d3d2
duplication
speciation
A B C D
The relation Θ̂ is only an estimate of a “correct” orthology relation:(x ,y) ∈Θ iff t(x ,y) = •= speciation
24 / 24
![Page 50: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/50.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
ILP - Cograph Editing
min ∑(x ,y)∈G×G
(1−Θxy )Exy + ∑(x ,y)∈G×G
Θxy (1−Exy )
Exy = 0 for all x ,y ∈G with σ(x) = σ(y)
Ewx +Exy +Eyz −Exz −Ewy −Ewz ≤ 2
∀ ordered tuples (w ,x ,y ,z) of distinct w ,x ,y ,z ∈G
This requires, O(|G|2) binary variables and O(|G|4) constraints; G=gene set.
24 / 24
![Page 51: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/51.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
ILP - Max. Consistent Triple Set
max∑(αβ |γ)∈S T ′(αβ |γ)
T ′(αβ |γ)+T ′
(αγ |β )+T ′(βγ |α) = 1
2T ′(αβ |γ)+2T ′
(αδ |β )−T ′(βδ |γ)−T ′
(αδ |γ) ≤ 2
0 ≤ T ′(αβ |γ)+T(αβ |γ)−2T ∗
(αβ |γ) ≤ 1
Weighted version:
max∑(αβ |γ)∈S T ′(αβ |γ) ∗w(αβ |γ)
Rooted species triples:T(αβ |γ) = 1 iff (αβ |γ) ∈ S
Max. consistent subset S∗ ⊂ S:T ∗(αβ |γ) = 1 iff (αβ |γ) ∈ S∗
Auxiliary consistent strict densespecies triples S′ with S∗ ⊆ S′:T ′(αβ |γ) = 1 iff (αβ |γ) ∈ S′
Thus maximizing |S ∩ S′| maxi-mizes |S∗| since S∗ = S∩S′
The ILP formulation that uses O(|S|3) variables and O(|S|4) constraints;S=species set.
TheoremA strictly dense triple set R on L with |L| ≥ 3 is consistent if and only ifcl(R̃)⊆ R holds for all R̃ ⊆ R with |R̃|= 2.
24 / 24
![Page 52: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/52.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
ILP - Least Resolved Treemin ∑p Yp
0 ≤ Yp|S|−∑α∈S Mαp ≤ |S|−1
0 ≤ Mαp +Mβp −2Nαβ ,p ≤ 1
1−|S|(1−T ∗(αβ |γ))≤
∑p Nαβ ,p −12 Nαγ ,p −
12 Nβγ ,p
Cp,q,01 ≥−Mαp +Mαq
Cp,q,10 ≥ Mαp −Mαq
Cp,q,11 ≥ Mαp +Mαq −1Cp,q,01 +Cp,q,10 +Cp,q,11 ≤ 2 ∀p,q
Set of clusters Mαp:Mαp = 1 iff α ∈S is containedin cluster p ∈ {1, . . . , |S|−2}.
Cluster p contains both species αand β (Nαβ ,p):Nαβ ,p = 1 iff Mαp = 1 and Mβp = 1
Compatibility (3-gamete condition):Cp,q,ΓΛ = 1 iff cluster p and qhave gamete ΓΛ ∈ {01,10,11}
Yp Non-trivial clusters: Yp=1 iffcluster p 6= /0.
This requires O(|S|3) variables and constraints; S= species set.
“partial” hierarchy: for p and q holds p∩q ∈ {p,q, /0}. (p,q compatible)p and q are incompatible if there are (not necessarily distinct) speciesα,β ,γ ∈S with α ∈ p \q and β ∈ q \p, and γ ∈ p∩q.Then (Mαp,Mαq) = (1,0), (Mβp,Mβq) = (0,1), (Mγp,Mγq) = (1,1).
24 / 24
![Page 53: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/53.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
% homologous noise
TT
dis
tanc
e
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
% orthologous noise
TT
dis
tanc
e
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
% paralogous noise
TT
dis
tanc
e
5 10 15 20 25
0.0
0.2
0.4
0.6
0.8
1.0
% xenologous noise
TT
dis
tanc
e
Figure: Accuracy of reconstructed species trees as function of noise level(p = 5−25%) and noise type in the raw orthology data Θ. Tree distanceis measured by the triple metric (TT) for 100 reconstructed phylogenetictrees with ten species.
24 / 24
![Page 54: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/54.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Trees and triples
a b c d e b d e
x
For three leaves a,b,c in T we write ab|c if the path from a to b does notintersect the path from c to the root.
Right Tree:R(T ) = {de|b}
Left Tree:R(T ) = {ab|c,ab|d,ab|e,de|a,de|b,de|c,cd|a,cd|b,ce|a,ce|b}
24 / 24
![Page 55: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/55.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Trees and triples
a b c d e b d e
x
For three leaves a,b,c in T we write ab|c if the path from a to b does notintersect the path from c to the root.
Right Tree:R(T ) = {de|b}
Left Tree:R(T ) = {ab|c,ab|d,ab|e,de|a,de|b,de|c,cd|a,cd|b,ce|a,ce|b}
An arbitrary set of triples R is consistent,if there is a tree that displays all triples in R
Exmpl: R(T ) is consistent. R(T )∪{eb|d} is not consistent.
24 / 24
![Page 56: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/56.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Trees and triples
a b c d e b d e
x
For three leaves a,b,c in T we write ab|c if the path from a to b does notintersect the path from c to the root.
Right Tree:R(T ) = {de|b}
Left Tree:R(T ) = {ab|c,ab|d,ab|e,de|a,de|b,de|c,cd|a,cd|b,ce|a,ce|b}
Theorem [Aho, Sagiv, Szymanski, Ullman - 1981, Semple & Steel - 2003]There is a polynomial time algorithm – called BUILD – that constructs a
tree for a given set of triples R or recognizes R as inconsistent.
24 / 24
![Page 57: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/57.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Triples for inferring the species tree
�
a ∈ A b ∈ B c ∈ C a ∈ A b ∈ B c ∈ C
Given an event-labeled gene tree (T , t) and ab|c ∈ R(T ).We write ab|c• if
t(lca(a,b,c)) = •= “speciation′′
We know the assignment of genes to the species in which they occur.This gives us the triple set:
S= {(AB|C : ∃ ab|c• with a ∈ A,b ∈ B,c ∈ C}
24 / 24
![Page 58: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/58.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Triples for inferring the species tree
S= {(AB|C : ∃ ab|c• with a ∈ A,b ∈ B,c ∈ C}
a a
b
c
d
e
a b
c
a b
e
a c
e
a d
e
b c
e
b d
e
c d
e
S= {AB|C,AB|E,AC|E,AD|E,BC|E,BD|E,CD|E}
24 / 24
![Page 59: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/59.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Triples for inferring the species tree
S= {(AB|C : ∃ ab|c• with a ∈ A,b ∈ B,c ∈ C}
GeneTree Species Tree
a
b
c
c
d e
f
f
A B C
D E
F
µ
TheoremThere is a species tree for the gene tree (T , t), i.e., for the symbolicrepresentation of Θ ⇐⇒ the triple set S is consistent.
A reconciliation map µ from (T , t) to the species tree S can beconstructed in polynomial time.
From Event-Labeled Gene Trees to Species Trees. , H.-Rosales M, Hellmuth M, Huber K, Moulton V, Wieseke N,Stadler PF, BMC Bioinformatics, 2012
24 / 24
![Page 60: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/60.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Inferring the Species Tree in O(|G||S|) timeGiven: Gene tree (T , t) = ((V ,E), t), Gene set G⊆ V
Consistent triple set S Species set Smap σ :G→S from genes to its respective species.
GeneTree
a
b
c
c
d e
f
f
24 / 24
![Page 61: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/61.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Inferring the Species Tree in O(|G||S|) timeGiven: Gene tree (T , t) = ((V ,E), t), Gene set G⊆ V
Consistent triple set S Species set Smap σ :G→S from genes to its respective species.
1. Construct a species tree S=(W,F) from S (e.g. with Build).
GeneTree Species Tree
a
b
c
c
d e
f
f
A B C
D E
F
µ
24 / 24
![Page 62: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/62.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Inferring the Species Tree in O(|G||S|) timeGiven: Gene tree (T , t) = ((V ,E), t), Gene set G⊆ V
Consistent triple set S Species set Smap σ :G→S from genes to its respective species.
1. Construct a species tree S=(W,F) from S (e.g. with Build).2. Construct the reconciliation map µ : V → W ∪F as follows:
GeneTree Species Tree
a
b
c
c
d e
f
f
A B C
D E
F
µ
24 / 24
![Page 63: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/63.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Inferring the Species Tree in O(|G||S|) timeGiven: Gene tree (T , t) = ((V ,E), t), Gene set G⊆ V
Consistent triple set S Species set Smap σ :G→S from genes to its respective species.
1. Construct a species tree S=(W,F) from S (e.g. with Build).2. Construct the reconciliation map µ : V → W ∪F as follows:
• µ(x) = σ(x) for allgenes x ∈G.
GeneTree Species Tree
a
b
c
c
d e
f
f
A B C
D E
F
µ
24 / 24
![Page 64: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/64.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Inferring the Species Tree in O(|G||S|) timeGiven: Gene tree (T , t) = ((V ,E), t), Gene set G⊆ V
Consistent triple set S Species set Smap σ :G→S from genes to its respective species.
1. Construct a species tree S=(W,F) from S (e.g. with Build).2. Construct the reconciliation map µ : V → W ∪F as follows:
• µ(x) = σ(x) for allgenes x ∈G.
• µ(x) = lcaS(σ(L(x))) ift(x) = •= speciation
GeneTree Species Tree
a
b
c
c
d e
f
f
A B C
D E
F
µ
24 / 24
![Page 65: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/65.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Inferring the Species Tree in O(|G||S|) timeGiven: Gene tree (T , t) = ((V ,E), t), Gene set G⊆ V
Consistent triple set S Species set Smap σ :G→S from genes to its respective species.
1. Construct a species tree S=(W,F) from S (e.g. with Build).2. Construct the reconciliation map µ : V → W ∪F as follows:
• µ(x) = σ(x) for allgenes x ∈G.
• µ(x) = lcaS(σ(L(x))) ift(x) = •= speciation
• µ(x) =[u, lcaS(σ(L(x)))] ift(x) =�= duplication
GeneTree Species Tree
a
b
c
c
d e
f
f
A B C
D E
F
µ
24 / 24
![Page 66: Why should we care about cograph heuristics? Phylogenomics ... · Intro Orthologs, Paralogs & Characterization ILP and Results The Problem in Practice a1 a2 b1 b2 c1 c3 x d1 d2 d3](https://reader035.fdocuments.in/reader035/viewer/2022081617/604788850231ed124c4eb665/html5/thumbnails/66.jpg)
Intro Orthologs, Paralogs & Characterization ILP and Results
Results - Real Life Data
Hydrogenobaculum Y04AA1
Sulfurihydrogenibium azorense
Sulfurihydrogenibium Y03AOP1
Thermocrinis ruber
Hydrogenobacter thermophilus
Thermocrinis albus
Hydrogenivirga sp.
Aquifex aeolicus
Persephonella marina
Desulfobacterium thermolithotrophum
Thermovibrio ammonificans
Aquificaceae
Hydrogenotherm-
aceae
Desulfurobacteri-
aceae
FamiliesSpecies
• Class of bacteria that live in harsh environmental settings, e.g., hotsprings, sulfur pools, and thermal ocean vents.
• 11 Aquificales species with 2887 gene families(1372 - 3809 genes per species)
• ProteinOrtho→ ILP-pipeline (CE→MCS→LRT).
24 / 24