Deciphering Reticulate Evolution Using Phylogenetic Reconciliation · 2018-05-29 · Deciphering...
Transcript of Deciphering Reticulate Evolution Using Phylogenetic Reconciliation · 2018-05-29 · Deciphering...
Deciphering Reticulate Evolution UsingPhylogenetic Reconciliation
Mukul S. Bansal
Department of Computer Science and Engineering,University of Connecticut,
USA
July 9, 2014
Gene Family Evolution
ProblemHow did any given gene family evolve?
I Gene families evolve inside species trees.I Affected by evolutionary events such as gene duplication,
horizontal gene transfer, and gene loss.
A B C D A B C D
y
x
y
z
a1 c1 b1 d1a2 c2
Species Tree S
x
z
Evolution of Gene Family G
Duplication
Loss
Transfer
Gene Tree on G
Definition: DTL Reconciliation
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
a1 c1 b1 d1a2 c2
G with evolutionary history
x,Σ
y,Δ
y,Σ z,Σ D,Θ
A B C D
x
y
z
Duplication
Loss
Transfer
Definition: DTL Reconciliation
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
a1 c1 b1 d1a2 c2
G with evolutionary history
x,Σ
y,Δ
y,Σ z,Σ D,Θ
Input: A gene tree for that gene family, and a trusted rootedspecies tree.Output: An evolutionary history of that gene family showinghorizontal gene transfers, gene duplications, losses, and speciationevents.
DTL Reconciliation Problem Formulation
Typical formulation:
I Costs are assigned to duplications, transfers, and losses.
I Goal: Find the reconciliation that minimizes the total cost.
I Easy to compute cost for a given reconciliation.
a1 c1 b1 d1a2 c2
x,Σ
y,Δ
y,Σ z,Σ D,Θ
Costs: L=1, Δ=2, Θ=3
2 + 3 + 2x1 = 7
1Δ, 1Θ, 2L
Different reconciliation could have different cost.
Applications of DTL Reconciliation
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
a1 c1 b1 d1a2 c2
G with evolutionary history
x,Σ
y,Δ
y,Σ z,Σ D,Θ
I Understanding how gene families evolve.
I Dating gene birth.
I Inferring orthologs/paralogs/xenologs.
I Accurate gene tree reconstruction.
I Whole genome species tree construction.
I Constructing species phylogenetic networks.
Traditional Bottlenecks of DTL Reconciliation
1. Slow algorithms: Fastest algorithms had cubic timecomplexity.
I Could not study large gene families.I No gene tree reconstruction or species tree reconstruction.
2. Inaccurate gene trees: Accuracy of reconciliation deeplyimpacted by accuracy of gene trees.
3. Multiple optima: There are often multiple optimal solutions.I Unclear how to interpret single reconciliation.I Algorithms to output all optimal reconciliations have
exponential time complexity.
4. Event costs: Event costs impact reconciliations.I What is the “correct” event cost assignment.
Recent Advances
Addressing the bottlenecks.
(a) Asymptotically faster algorithms.
(b) Method for accurate prokaryotic gene tree reconstruction.
(c) Handling multiple optima by uniform sampling.
(d) Techniques and tools for studying impact of event costs.
(a) Faster Algorithms
Our Results
For undated species tree:
I O(mn)-time algorithm.
I Factor of n speedup.
For dated species trees:
I O(mn log n)-time algorithm for the fully dated version.(Transfers restricted to co-existing species).
I Factor of n/ log n speedup.
For improved accuracy:
I General O(mn2)-time algorithm that can handledistance-dependent transfer costs.
I Factor of n speedup.
Dynamic Programming Algorithm
A B C D
Species Tree S
a1 c1 b1 d1a2 c2
Gene Tree G
g s
Given any g ∈ V (G ) and s ∈ V (S), let
I c(g , s) = cost of an optimal reconciliation of G (g) with Ssuch that g is mapped to s.
Similarly, define:
I cΣ(g , s): with restriction that g is speciation.
I c∆(g , s): with restriction that g is duplication.
I cΘ(g , s): with restriction that g is transfer.
Dynamic Programming Algorithm
A B C D
Species Tree S
a1 c1 b1 d1a2 c2
Gene Tree G
gs
0inf
c(g , s) =
0 if g ∈ Le(G ) and s = M(g)∞ if g ∈ Le(G ) and s 6= M(g)min{cΣ(g , s), c∆(g , s), cΘ(g , s)} otherwise.
DP Algorithm: Nested Post-Order Traversal
I Nested post-order traversal of the gene tree and species treeto compute all values (cΣ(g , s),c∆(g , s),cΘ(g , s),c(g , s)).
A B C D
Species Tree S
a1 c1 b1 d1a2 c2
Gene Tree G
g sΣ
I Source of speedup is to compute each value in constant time.
DP Algorithm: Termination
A B C D
Species Tree S
a1 c1 b1 d1a2 c2
Gene Tree G g
11
7
I The optimal reconciliation cost of G and S is simply:mins∈V (S) c(rt(G ), s)
Dynamic Programming Algorithm
A B C D
Species Tree S
A B C D
Species Tree S
5
3
Dated species trees:
I Places constraints on transfers. Breaks clean structure.
I Constant time slows to log n time.
Distance dependant transfer costs:
I Each potential transfer event must be evaluated separately.
I Constant time slows to O(n) time.
Drastic Improvement in Runtime and Scalability
Dataset Type Dataset Size RANGER-DTL-U AnGST Mowgli
Simulated50 taxa (100 datasets) 2s 3m:26s 28m:30s
100 taxa (100 datasets) 3s 15m:4s 3h:52m200 taxa (100 datasets) 9s 1h:2m 29h:43m500 taxa (100 datasets) 35s >800h >400h
1,000 taxa (100 datasets) 2m:57s – >6,000h10,000 taxa (1 dataset) 4h:7m – –
Biological 4,733 gene trees, 100 taxa 1m:03s 3h:45m 41h:36m
I 50 taxa: 3m/28m → 2s
I 100 taxa: 15m/3h → 3s
I 200 taxa: 1h/29h → 9s
I 500 taxa: >800h/>400h → 35s
I 1000 taxa: –/>6000h → 3m
I 10000 taxa: –/– → 4h.
Faster Algorithms Enable Many New Applications
I Efficient genome-scale analysis.
I Much larger species tree.
I Species tree reconstruction.
I Accurate gene tree reconstruction.
I Phylogenetic network inference.
→ M. S. Bansal, E. J. Alm, and M. Kellis, “Efficient Algorithms for the Reconciliation Problem with GeneDuplication, Horizontal Transfer, and Loss”. Twentieth Annual International Conference on IntelligentSystems for Molecular Biology (ISMB 2012); Bioinformatics 2012, 28: i283–i291.
(b) Accurate Gene Tree Reconstruction
Gene Tree Reconstruction
I Gene trees are hard to reconstruct accurately.I Eukaryotes: TreeBest, SPIDIR, PrIME-GSR, SPIMAP,
NOTUNG, tt, TreeFix.I Prokaryotes: Recent efforts: AnGST and MowgliNNI. (Also
ALE in probabilistic framework.)
GA ..... CA
AT ..... CA
GT ..... CC
A B C D
Species Tree S
a1 c1 b1 d1a2 c2
Gene Tree
Developing a Principled Approach
Goal:
I Use fast algorithms to develop an efficient and principledapproach for prokaryotic gene tree reconstruction.
Outcome:
I TreeFix-DTL: A highly accurate method for prokaryotic genetree reconstruction.
TreeFix-DTL: Algorithm Overview
Input: ML gene tree, multiple sequence alignment, and rootedspecies tree.Output: Reconstructed (error-corrected) gene tree.
I Statistically informed, fast and scalable, easy to use.
Statistically Equivalent Gene Trees
Species Tree
104
76
91
Dataset Generation
Basic simulation setup:
I 50 taxa Yule species trees.
I Gene trees with low, medium, and high rates of D, T, L.
I Mutation rates (substitutions per site) 1, 3, 5, and 10.
I Simulated amino acid sequences of length 173 and 333.
I Reconstructed using RAxML.
Total of 24 different datasets, each with 100 gene trees.
Performance: TreeFix-DTL is Highly Accurate
RAxML NOTUNG Treefix MowgliNNI AnGST TreeFix-DTL
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
Rate 1 Rate 3 Rate 5 Rate 10
No
rmal
ize
d R
F d
ista
nce
Reconstruction accuracy for different mutation rates(Low rate of DTL, Sequence Length: 333)
Performance: Tree Reconstruction Accuracy
Method Normalized RF distance Perfect reconstructionRAxML 0.097 3.04%
NOTUNG 0.088 13.08%TreeFix 0.079 10.29%
MowgliNNI 0.039 22.17%AnGST 0.032 29.08%
TreeFix-DTL 0.028 38.21%
1. RAxML and eukaryotic methods error prone and ineffective.
2. TreeFix-DTL is highly accurate.
Performance: Greatly Improved Reconciliation Accuracy
Accuracy of inferred duplications and transfers: Averaged across allsimulated datasets.
% p
reci
sion
020
6010
0
80 81 8194
% s
ensi
tivity
020
6010
0
8189 90
98
% p
reci
sion
020
6010
0
3871 76
96
% s
ensi
tivity
020
6010
0
68 68 73
92
RAxML AnGST TreeFix-DTL True
Duplications Transfers Losses
% p
reci
sion
020
6010
0
31
73 7685
% s
ensi
tivity
020
6010
0
7687 87 92
Impact of TreeFix-DTL
I Now possible to reconstruct gene trees very accurately.
I Large impact on accuracy of reconciliation and on anydownstream analyses.
→ M. S. Bansal, Y. Wu, E. J. Alm, and M. Kellis, “Improved Gene Tree Error-Correction for DecipheringMicrobial Evolution”. Under review.
→ Y. Wu, M. D. Rasmussen, M. S. Bansal, M. Kellis. TreeFix: statistically informed gene tree errorcorrection using species trees. Systematic Biology 62(1): 110-120, 2013.
(c) Handling Multiple Optima
Optimal Reconciliations Need Not Be Unique
Costs: L=1, Δ=1, Θ=3
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
A B C D
x
y
z
Duplication
Loss
Transfer
A B C D
x
zTransfer
Transfer
y
I Which one is the “correct” reconciliation?
Number of Optimal Solutions
1 17%
[2,9] 31%
[10,99] 20%
[100, 10k] 17%
[10k, 100k] 5%
[100k, 100Mil] 7%
>=100Mil 3%
Number of optimal solutions (4699 trees)
Exponential Increase with Gene Tree Size
110
1001000
10000100000
100000010000000
1000000001E+091E+101E+111E+121E+131E+141E+151E+16
0 20 40 60 80 100 120 140 160 180 200
Num
ber o
f Sol
utio
ns
Gene Tree Size
Number of Optimal Solutions by Gene Tree Size
Dealing With Non-Uniqueness
Fundamental question:
I How different are the different optimal reconciliations?
Costs: L=1, Δ=1, Θ=3
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
A B C D
x
y
z
Duplication
Loss
Transfer
A B C D
x
zTransfer
Transfer
y
x,Σ
z,Σ
y
D,Θ
In this case, most node mappings and event assignments areconsistent!
Goals
Goals:
I Understand the space of optimal reconciliations.
I Specifically, understand how different the mapping and eventassignments can be in the different optima.
Solution: Sample the space of optimal reconciliations uniformly atrandom.
I Enables systematic study of space of optima for any inputinstance.
I Enables quick estimates of “support” for any event ormapping assignment.
Sample Optimal Reconciliations Uniformly
I We show that this is possible to do using a modification ofthe DP algorithm.
I Idea is to keep track of number of optimal solutions for eachsubproblem, and then weigh different equally optimal choicesduring backtracking step.
I Complexity increases from O(mn) to O(mn2).
Application to Biological Dataset
Dataset:>4700 gene trees from a set of 100 species broadly sampled acrossthe tree of life.
I Ran 500 times on each gene tree and aggregated the 500reconciliations.
Aggregation: For each internal node in the gene tree:
I How consistent are the 500 mapping assignments? (Frequencyof most frequent mapping).
I How consistent are the 500 event assignments? (Frequency ofmost frequent event assignment).
Mapping Assignment Consistency
73.15
0.90 1.29
19.10
5.55
0
10
20
30
40
50
60
70
80
90
100
100 [90, 99] [80, 89] [50, 79] [0, 49]
Frac
tion
of n
odes
(%)
Consistency (%)
I For many nodes, the mapping assignments are 100%consistent.
I But significant fraction of nodes have inconsistent mappingassignments.
Event Assignment Consistency
93.31
0.63 0.79 4.97
0.25 0
10
20
30
40
50
60
70
80
90
100
100 [90, 99] [80, 89] [50, 79] [0, 49]
Frac
tion
of n
odes
(%)
Consistency (%)
I Good news: Event assignments are highly consistent!
Impact of Uniform Sampling on Handling Multiple Optima
I Can efficiently study the space of optimal reconciliations.
I Can be used to determine consistency of the mapping andevent assignment for any gene tree node.
I Event assignments are highly consistent: Great for functionalgenomic studies.
A B C D
y
Species Tree S
x
z
a1 c1 b1 d1a2 c2
Gene Tree on G
a1 c1 b1 d1a2 c2
x,Σ
y,Δ
y,Σ z,Σ D,Θ(80%,60%)
→ M. S. Bansal, E. J. Alm, and M. Kellis, “Reconciliation Revisited: Handling Multiple Optima whenReconciling with Duplication, Transfer, and Loss”. Journal of Computational Biology (JCB), 20(10):738-754, 2013.
(d) Impact of Event Costs
Event Costs Define Optimal Reconciliations
I Which one is the “correct” reconciliation?
Uncertainty in Event Cost Assignments
Fundamental question:I What are the “correct” event costs?
I Difficult to estimate.I “Correct” event costs may not even exist.
I How do reconciliations vary as we change event costs?
Pareto-optimal Reconciliations
I Keep track of all Pareto-optimal Reconciliations.I No other reconciliation is better in all three event counts.I Represents reconciliations that could be optimal for some
assignment of event costs.
I O(m5n logm) time.
I Use to partition event cost space into equivalence regions.
Equivalence RegionsT
rans
fer
cost
rel
ativ
e to
dup
licat
ion
5
4
3
2
1
54321
<6, 1, 2, 3> Count = 3<6, 0, 3, 1> Count = 2<4, 0, 5, 0> Count = 8<4, 1, 4, 0> Count = 2<6, 2, 1, 5> Count = 1<5, 4, 0, 10> Count = 1
Loss cost relative to duplication
Tra
nsfe
r co
st r
elat
ive
to d
uplic
atio
n
Loss cost relative to duplication1 2 3 4 5
1
2
3
4
5(a) (b) (c)
Impact in Practice
I Can visualize equivalence regions.
I Can choose suitable event cost assignments.
I Can be used to determine “consensus” mappings and eventassignments.
→ R. Libeskind-Hadas, Y. Wu, M. S. Bansal, and M. Kellis, “Pareto-Optimal Phylogenetic TreeReconciliation”. ISMB 2014; Bioinformatics 30: i87-i95, 2014.
In Summary
I There now exist effective algorithms and methods to handlethe major bottlenecks:
1. Slow algorithms → Asymptotically faster algorithms2. Inaccurate gene trees → TreeFix-DTL3. Multiple optima → Uniformly random sampling4. Impact of event costs → Pareto-optimality, equivalence regions
I Easy to use, Fast, Accurate, Scalable.
Constructing Phylogenetic Networks
Using existing software:
I Build accurate gene trees.
I Obtain a species tree (can be subset of network).
I Use DTL reconciliation to reconcile individual gene trees.
I Aggregate transfers inferred for gene trees onto species tree.
I Advantages: Scalable, can handle all gene families, not fooledby duplication/loss and ILS.
Future Work:
I Program to aggregate reticulations onto species tree to drawnetworks and infer highways. Use global view to refineindividual reconciliations.
I Reconciliation against species networks.
Thank You!
Questions!