Understanding sets of trees CS 394C September 10, 2009.
-
Upload
molly-hawkins -
Category
Documents
-
view
217 -
download
3
Transcript of Understanding sets of trees CS 394C September 10, 2009.
![Page 1: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/1.jpg)
Understanding sets of trees
CS 394C
September 10, 2009
![Page 2: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/2.jpg)
Basic challenge
• Phylogenetic analyses are sometimes based upon a single marker, but often based upon many markers
• Each marker can be analyzed separately, or the entire set can be combined into one “super-matrix”
• Each matrix (each dataset) can result in many trees (almost no matter how you analyze the matrix)
What to do with huge numbers of trees?
![Page 3: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/3.jpg)
What to do?
• How to estimate evolutionary history from many trees
• How to efficiently store large sets of trees
• How to enable efficient queries of the set of trees
![Page 4: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/4.jpg)
What to do?
• How to estimate evolutionary history from many trees
• How to efficiently store large sets of trees
• How to enable efficient queries of the set of trees
![Page 5: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/5.jpg)
First, a few questions:
• Why are gene trees different from the species tree?
• Why are estimated gene trees different from the true gene tree?
• Under what conditions is the true evolutionary history not a tree? (i.e., what is “reticulation”?)
![Page 6: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/6.jpg)
Reticulation
• Evolutionary histories can be reticulate (meaning non-treelike):– Horizontal Gene Transfer (HGT)– Hybrid speciation– Recombination
• Most phylogeny estimation methods produce trees.
• Good resource about reticulate phylogenies: book chapter by Luay Nakhleh (see 394C webpage for the link)
![Page 7: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/7.jpg)
• We will assume that all evolutionary histories are treelike for the remainder of today’s presentation.
• Later in the course we’ll discuss reticulate evolution…
![Page 8: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/8.jpg)
Estimated Gene Trees can differ from Species Trees
• Biological reasons:– Deep coalescent events (alleles)– Gene duplication and loss (gene families)
• Computational reasons: – Insufficient time– Poor methods (e.g., UPGMA)– Poor models (e.g., ML using Jukes-Cantor)
• Data issues:– Insufficient data (meaning not enough sites)– Poor alignments
![Page 9: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/9.jpg)
Examples of problems
When true gene trees can differ from species tree:• Given a collection of gene trees, find a species tree
that minimizes the number of “deep coalescent” events
When true gene trees should equal the species tree:• Given a collection of gene trees, find a species tree
that minimizes the total distance to the gene trees
![Page 10: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/10.jpg)
When gene trees can differ from species tree
Software/Algorithms for deep-coalescent (see PhyloNet from Nakhleh’s webpage at Rice)
GLASS (Roch and Mossel) - distance-basedMDC (Than and Nakhleh) - parsimony
STEM (Kubatko) - ML
BEST (Liu et al.) - Bayesian
BUCKy (Ané et al.) - Bayesian
Software/Algorithms for duplication-loss
NOTUNG (Durand)
Duptree (Bansal et al.)
Hallet and Lagergren - algorithms/complexity
![Page 11: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/11.jpg)
When gene trees should equal the species tree
• The problem here is that estimated gene trees can differ from the true gene trees.
• Although the problem is “simple”, it is still interesting -- computationally and mathematically.
• Plus, we can still make novel contributions.
![Page 12: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/12.jpg)
The very simplest problem
Easiest case:• One species tree, true gene trees will agree with the
species tree, • Estimated trees are on the full set of taxa
Approaches:Consensus methods: return a tree on the entire set S of taxa
summarizing the input treesAgreement methods: return a tree on a subset of the taxa on
which the trees agreeClustering, then consensus/agreement
![Page 13: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/13.jpg)
Consensus methods
• These are the most usual ways of analyzing datasets of trees
• Examples:– Strict consensus– Majority consensus– Greedy consensus (aka “extended majority”)– Others less frequently used include: Gordon’s,
Adams, the Strict Consensus Supertree, Local Consensus methods, and more.
• Survey paper by David Bryant for some of these
![Page 14: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/14.jpg)
Simplest problems, cont.
• “Agreement” methods return trees on subsets of S, on which the trees are the same (or compatible)– MAST: maximum agreement subtree (used in
practice, sometimes)
– MCST: maximum compatible subtree (Ganapathy et al., not used in practice)
• The difference between these is how polytomies are handled
![Page 15: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/15.jpg)
Soft vs. hard polytomies
• Polytomy: node of high degree (greater than three for an unrooted tree)
• Polytomies arise in estimations when consensus methods are used
• Polytomies also arise when contracting short branches in estimated trees
• Polytomies can be “hard” (representing true radiations) or “soft” (representing lack of information)
![Page 16: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/16.jpg)
Compatible source trees
• Estimated trees can be “compatible” when we interpret polytomies as “soft”
• “Compatible” means that there is a tree which is a common refinement.
• Example: 123|456, 12|3456, 1235|46.
• We can compute the compatibility tree (when it exists) in O(nk) time, where n=|S| and there are k source trees
![Page 17: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/17.jpg)
Computational complexity
• Most consensus methods (which return a tree on the entire set S of taxa) are polynomial time.
• Most “agreement methods” (which return a tree on the largest subset of the taxa on which the source trees “agree”) are based upon NP-hard problems. Some (e.g., MAST) have fixed-parameter polynomial time solutions.
![Page 18: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/18.jpg)
Supertree problems
• Realistic complexity: not all the source trees are on the same set of taxa.
• Obvious problems: – Find the tree on which all the source trees
agree (if it exists).– Find the tree on which a maximum number
of the source trees agree.
• Both are NP-hard.
![Page 19: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/19.jpg)
Quartet compatibility
• Simple case: all the source trees are on four taxa.
• We ask: does there exist a tree which agrees with all the source trees?
• NP-hard!
![Page 20: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/20.jpg)
Quartet tree amalgamation
• Given collection of quartet trees, find a tree which agrees with a maximum number of these quartet trees
NP-hard, since compatibility is NP-hardHard to approximate, but PTAS if you
have a tree on every quartet of taxa (Jiang et al.)
![Page 21: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/21.jpg)
Quartet amalgamation algorithms
• Quartet Puzzling (Strimmer and von Haeseler)
• Q* (Berry et al.)
• Quartet Cleaning (Berry et al.)
• Weight Optimization (Ranwez and Gascuel)
• Quartets MaxCut (Snir and Rao)
But see also the paper (St. John et al.) evaluating early quartet methods on the CS 394C webpage
![Page 22: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/22.jpg)
What about rooted trees?
Given set of rooted source trees, we ask:
• Is there a tree on which all the rooted source trees are correct?
![Page 23: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/23.jpg)
Rooted tree compatibility
• Aho, Sagiv, Szymanski, and Ullman: polynomial time, recursive algorithm:– If n=1, return the singleton tree.– If n>1, then compute an equivalence relation on the
set of taxa as follows. • For each rooted triple ((a,b),c) in the set, put a and b in the
same equivalence class. • Compute transitive closure.
– If only one equivalence class, reject (set is incompatible). Otherwise, recurse on each subset, and return tree obtained by making all recursively computed trees sibling subtrees.
![Page 24: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/24.jpg)
Subtree compatibility
• If source trees are rooted, then compatibility can be tested in polynomial time. Optimization problems are NP-hard, however.
• If source trees are unrooted, then compatibility is NP-hard. And so optimization problems are also NP-hard.
![Page 25: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/25.jpg)
Supertree problems, in practice
• In practice, the most frequently used supertree method is MRP, for “Matrix Representation with Parsimony”.
• There are, however, many other supertree methods!
![Page 26: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/26.jpg)
Many Supertree Methods
• MRP• weighted MRP• Min-Cut• Modified Min-Cut• Semi-strict Supertree• MRF• MRD• QILI
• SDM• Q-imputation• PhySIC• Majority-Rule
Supertrees• Maximum Likelihood
Supertrees• and many more ...
Matrix Representation with Parsimony(Most commonly used)
![Page 27: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/27.jpg)
MRP
• Idea: take every sourcetree, and replace it with a matrix of 0,1,?.
• Concatenate the matrices.• Apply Maximum Parsimony.
If all the source trees are compatible, then an exact solution to MRP will return the compatibility trees.
![Page 28: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/28.jpg)
Homework, due 9/15
• Read two papers (linked on the webpage):– St. John et al., about quartet-based
methods– Moret et al., about sequence-length
requirements
• Pick one, write summary, and include questions
![Page 29: Understanding sets of trees CS 394C September 10, 2009.](https://reader035.fdocuments.in/reader035/viewer/2022062805/5697c00b1a28abf838cc86f9/html5/thumbnails/29.jpg)
Question!
• How do you feel about occasionally having class on some Monday or Friday, so we can have guest lectures?