Supertrees and the Tree of Life Tandy Warnow The University of Texas at Austin.
Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center...
-
Upload
sabrina-diane-watts -
Category
Documents
-
view
216 -
download
0
description
Transcript of Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center...
![Page 1: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/1.jpg)
Problems with large-scale phylogeny
Tandy Warnow, UT-AustinDepartment of Computer SciencesCenter for Computational Biology
and Bioinformatics
![Page 2: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/2.jpg)
Phylogeny
Orangutan Gorilla Chimpanzee Human
From the Tree of the Life Website,University of Arizona
![Page 3: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/3.jpg)
DNA Sequence Evolution
AAGACTT
TGGACTTAAGGCCT
-3 mil yrs
-2 mil yrs
-1 mil yrs
today
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
TAGCCCA TAGACTT AGCGCTTAGCACAAAGGGCAT
AGGGCAT TAGCCCT AGCACTT
AAGACTT
TGGACTTAAGGCCT
AGGGCAT TAGCCCT AGCACTT
AAGGCCT TGGACTT
AGCGCTTAGCACAATAGACTTTAGCCCAAGGGCAT
![Page 4: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/4.jpg)
Molecular Systematics
TAGCCCA TAGACTT TGCACAA TGCGCTTAGGGCAT
U V W X Y
U
V W
X
Y
![Page 5: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/5.jpg)
Quantifying Error
FN: false negative (missing edge)FP: false positive (incorrect edge)
50% error rate
FN
FP
![Page 6: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/6.jpg)
Methods and Conjectures
• Popular methods: Neighbor-Joining (polynomial time, distance-based), heuristics for Maximum Parsimony and Maximum Likelihood
• Big debates about which is better, and when
![Page 7: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/7.jpg)
Methods and Conjectures
• Popular methods: Neighbor-Joining (polynomial time, distance-based), heuristics for Maximum Parsimony and Maximum Likelihood
• Big debates about which is better, and when• Our research shows: big differences between NJ
and MP, on large enough trees
![Page 8: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/8.jpg)
Methods and Conjectures
• Popular methods: Neighbor-Joining (polynomial time, distance-based), heuristics for Maximum Parsimony and Maximum Likelihood
• Big debates about which is better, and when• Our research shows: big differences between NJ
and MP, on large enough trees• Our research also shows that current techniques
(in the best software packages) can be sped up, to solve MP and ML faster.
![Page 9: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/9.jpg)
Computational challenges for Assembling the Tree of Life
8 million species for the Tree of Life -- cannot currently analyze more than a few hundred (and even this takes years)
• We need new methods for inferring large phylogenies - hard optimization problems!
• We need new software for visualizing large trees• We need new database technology• Not all phylogenies are trees, so we need methods
for inferring phylogenetic networks
![Page 10: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/10.jpg)
Our research projects
DCM-boosting phylogenetic reconstruction methods (improving the accuracy of NJ and speeding-up MP and ML)
Phylogenetic reconstruction from gene ordersReticulate evolution detection and
phylogenetic network reconstructionVisualization of large trees
![Page 11: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/11.jpg)
DCM-boosting NJ
Outline: Convergence rates (how long do the
sequences need to be for methods to reconstruct the true tree with high probability?)
DCM-boosting Neighbor-JoiningExperimental study comparing DCM-NJ to
NJ on large trees
![Page 12: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/12.jpg)
The Jukes-Cantor model of DNA sequence evolution
• A random DNA sequence evolves down the tree from the root
• The positions within the sequence evolve independently and identically
• If the nucleotide at a particular position changes on an edge, it changes with equal probability to the other nucleotides
![Page 13: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/13.jpg)
The General Markov model of DNA sequence evolution
• A random DNA sequence evolves down the tree from the root
• The positions within the sequence evolve independently and identically (or under a distribution of rates across sites)
• Each edge has a 4x4 stochastic substitution matrix governing the evolution of a random site on the edge
![Page 14: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/14.jpg)
Statistical Performance Issues
• Statistical consistency: does the reconstruction method return the true tree with high probability from long enough sequences?
• “Convergence Rate”: at what sequence length will the reconstruction method return the true tree with high probability?
• Robustness: if we violate the model conditions, what can we say about the performance of the method?
![Page 15: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/15.jpg)
Absolute fast convergence vs. exponential convergence
![Page 16: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/16.jpg)
Theoretical Comparison of Methods
• Theorem 1 [Warnow et al. 2001]DCMNJ is absolute fast converging for the GM model.
• Theorem 3 [Atteson 1999]NJ is exponentially converging for the GM model (but is not known to be afc).
![Page 17: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/17.jpg)
DCM1: a divide-and-conquer strategy to improve NJ’s accuracy
Phase I: Basic step: Divide the dataset into many small diameter subproblems. Construct NJ trees on each subproblem, and merge subtrees, using the “Strict Consensus Merger”. Refine the resultant tree using PAUP*’s constrained search. Do the basic step for each way of setting the diameter.Phase II: Pick the “best tree” out of the set of O(n2) trees.
![Page 18: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/18.jpg)
Strict Consensus Merger
1 2
3
4 65
1 2
37 4
1
3
2
4
1 2
3 4
1 2
3 4
1
2
3
4
5
6
7
![Page 19: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/19.jpg)
DCM-Boosting [Warnow et al. 2001]
• DCM+SQS is a two-phase procedure which reduces the sequence length requirement of methods.
DCM SQSExponentiallyconvergingmethod
Absolute fast convergingmethod
• DCMNJ+SQS is the result of DCM-boosting NJ.• We can replace SQS by MP or ML, and get
better empirical performance (though not provably afc)
![Page 20: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/20.jpg)
DCM-boosting Neighbor Joining
• DCM-boosting makes distance-based methods more accurate (we have established this for other distance-based methods, too)
NJDCM-NJ
0 400 800 16001200No. Taxa
0
0.2
0.4
0.6
0.8
Erro
r Rat
e
![Page 21: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/21.jpg)
Summary of DCM-NJ
• These are the first polynomial time methods that improve upon NJ (with respect to topological accuracy) and are never worse than NJ.
• The advantage obtained with DCMNJ+MP and DCMNJ+ML increases with number of taxa, deviation from a molecular clock, and rate of evolution.
• In practice these new methods are slower than NJ (minutes vs. seconds), but still much faster than MP and ML (which can take days).
![Page 22: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/22.jpg)
Time is a bottleneck for MP and ML
Phylogenetic trees
MP scoreGlobal optimum
Local optimum
• Systematists tend to prefer trees with the optimal maximum parsimony score or optimal maximum likelihood score; however, both problems are hard to solve
• (Our experimental studies show that NJ doesn’t do as well as MP when trees are big and have high rates of evolution, so NJ and other fast methods aren’t sufficiently reliable.)
![Page 23: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/23.jpg)
MP/ML heuristics
Time
MP scoreof best trees
Performance of hill-climbing heuristic
Fake study
![Page 24: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/24.jpg)
DCM-boosting Speeding up MP/ML heuristics
Time
MP scoreof best trees
Performance of hill-climbing heuristic
Desired Performance
Fake study
![Page 25: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/25.jpg)
Maximum Parsimony
ACT
GTT ACA
GTA ACA ACT
GTAGTT
ACT
ACA
GTT
GTA
![Page 26: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/26.jpg)
Maximum Parsimony
ACT
GTT
GTT GTA
ACA
GTA
12
2
MP score = 5
ACA ACT
GTAGTT
ACA ACT3 1 3
MP score = 7
ACT
ACA
GTT
GTAACA GTA1 2 1
MP score = 4
Optimal MP tree
![Page 27: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/27.jpg)
Maximum Parsimony: computational complexity
ACT
ACA
GTT
GTAACA GTA
1 2 1
MP score = 4
Finding the optimal MP tree is NP-hard
Optimal labeling can becomputed in linear time O(nk)
![Page 28: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/28.jpg)
The DCM technique for speeding up MP/ML searches
![Page 29: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/29.jpg)
DCM2-MP/ML
• Step 1: pick a threshold at which the threshold graph is connected, and divide the dataset into two overlapping subsets.
• Step 2: Compute trees on each subset using a heuristic for MP or ML
• Step 3: Merge subtrees using the Strict Consensus Merger
• Step 4: Refine the resultant tree using PAUP* constrained search
![Page 30: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/30.jpg)
DCM2 vs hill-climbing
Biological dataset of 388 rRNA sequences. Maximum subproblem size = 70%
![Page 31: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/31.jpg)
DCM2 vs hill-climbing
Biological dataset of 503 rRNA sequences. Maximum subproblem size = 64%
![Page 32: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/32.jpg)
DCM2 vs hill-climbing
Biological dataset of 816 rRNA sequences. Maximum subproblem size = 55%
![Page 33: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/33.jpg)
What we see
• Some datasets decompose well, and DCM gives real advantage
• The bigger the dataset, and the more careful the heuristic search, the less good the decomposition has to be for DCM to give an advantage
• Outlier identification may help
![Page 34: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/34.jpg)
Other projects (briefly)
• Gene order phylogeny: GRAPPA (our free software) is the fastest and most accurate software for reconstructing phylogenies from gene order and content data. Joint project with Bob Jansen (UT) and Bernard Moret (UNM), and others.
• Reticulate evolution inference. Our research shows no existing method for reconstructing networks work, and that methods (such as ILD) for detecting reticulation fail. Joint project with Randy Linder (UT) and Bernard Moret.
![Page 35: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/35.jpg)
Acknowledgements
• Funding: The David and Lucile Packard Foundation, and The National Science Foundation.• Collaborators: Bernard Moret (UNM), Daniel Huson (Tubingen),
Lisa Vawter (Aventis), Katherine St. John (CUNY), Randy Linder (UT), Bob Jansen (UT)
• Students: Luay Nakhleh, Usman Roshan, Jerry Sun, and Li-San Wang
![Page 36: Problems with large-scale phylogeny Tandy Warnow, UT-Austin Department of Computer Sciences Center for Computational Biology and Bioinformatics.](https://reader035.fdocuments.in/reader035/viewer/2022062504/5a4d1b897f8b9ab0599bdf0a/html5/thumbnails/36.jpg)
Phylolab, U. TexasPlease visit us athttp://www.cs.utexas.edu/users/phylo/