.
Advanced programming 236512Algorithms for reconstructing
phylogenetic trees spring 2006
Lecturer: Shlomo Moran, Taub 639, tel 4363TA: Ilan Gronau, Taub 700, tel 4894Website: http://webcourse.cs.technion.ac.il/236512/
2
Evolution
Evolution of new organisms is driven by
Diversity Different individuals
carry different variants of the same basic blue print
Mutations The DNA sequence
can be changed due to single base changes, deletion/insertion of DNA segments, etc.
Selection bias
3
The Tree of Life
Sou
rce:
Alb
erts
et
al
4
Primate evolution
A phylogeny is a tree that describes the sequence of speciation events that lead to the forming of a set of current day species; also called a phylogenetic tree.
5
Theory of Evolution
Basic idea speciation events lead to creation of different
species. Speciation caused by physical separation into
groups where different genetic variants become dominant
Any two species share a (possibly distant) common ancestor
6
Phylogenenetic trees
Leaves - current day species (or taxa – plural of taxon) Internal vertices - hypothetical common ancestors Edges length - “time” from one speciation to the next
Aardvark Bison Chimp Dog Elephant
7
Types of Trees
A natural model to consider is that of rooted trees
CommonAncestor
8
Types of treesUnrooted tree represents the same phylogeny without
the root node
Usually, data from current day species does not distinguish between different placements of the root.
9
Rooted versus unrooted treesTree a
ab
Tree b
c
Tree c
Represents the three rooted trees
10
Positioning Roots in Unrooted Trees
We can estimate the position of the root by introducing an outgroup:
a set of species that are definitely distant from all the species of interest
Aardvark Bison Chimp Dog Elephant
Falcon
Proposed root
11
Morphological vs. Molecular
Classical phylogenetic analysis: morphological features: number of legs, lengths of legs, etc.
Modern biological methods allow to use molecular features
Gene sequences Protein sequences
Analysis based on homologous sequences (e.g., globins) in different species
12
Rat QEPGGLVVPPTDA
Rabbit QEPGGMVVPPTDA
Gorilla QEPGGLVVPPTDA
Cat REPGGLVVPPTEG
From sequences to a phylogenetic tree
There are many possible types of sequences to use (e.g. Mitochondrial vs Nuclear proteins).
13
Type of Data
Distance-based (The project focus on this method). Input is a matrix of distances between species Can be fraction of residue they disagree on, or
alignment score between them, or …
Character-based Examine each character (e.g., residue) separately
Not covered in this project
14
Constructing trees from distances:
Transform differences between species to numerical distances
Find a weighted tree that realizes/approximates the distances between the species.
The task is:Given a set of species (leaves in a supposed tree), and distances between them – construct a phylogeny which best “fits” the distances.
15
Exact solution: Additive sets
Given a set S of n objects with an n×n distance matrix:d(i,i)=0, and for i≠j, d(i,j)>0d(i,j)=d(j,i). For all i,j,k it holds that d(i,k) ≤ d(i,j)+d(j,k).
Can we construct a weighted tree which realizes these distances?
16
There is always a tree for 3 objects
For n=3: There is always a (unique) tree with one internal node.
( , )( , )( , )
d i j a bd i k a cd j k b c
ab
c
i
j
k
v
i j k
i 0 a+b a+c
j 0 b+c
k 0
Distance metrics on 4 objects may not have a tree.
17
The Four Points Condition
Definition: A distance metric on n objects satisfies the four points condition iff any subset of four objects can be labeled i,j,k,l so that:
d(i,k) + d(j,l) = d(i,l) +d(k,j) ≥ d(i,j) + d(k,l)
ik
lj
Theorem: A distance metric is additive , it satisfies the four points conditionNote: The four point condition implies O(n4) algorithm, which is not very efficient.
18
Constructing additive trees:The neighbor joining problem
Let i, j be neighboring leaves in a tree, let v be their parent, and let k
be any other vertex.
The formula
shows that we can compute the distances of v to all other leaves.
1
2( , ) [ ( , ) ( , ) ( , )]d k v d k i d k j d i j
d(k,v)
i
j
k
v
19
Constructing additive trees:The neighbor joining problem
This suggest the following method to construct tree from a distance
matrix:
1. Find neighboring leaves i,j in the tree,
2. Replace i,j by their parent v and recursively construct a tree T
for the smaller set.
3. Add i,j as children of v in T.
20
Neighbor Finding
How can we find from distances alone a pair of neighboring leaves (called also cherries)?
Closest vertices aren’t necessarily neighboring leaves.
AB
CD
21
Neighbor Finding: Seitou&Nei method
Theorem (Saitou&Nei) Assume all internal edge weights are positive. If Q(i,j) is minimal (among all pairs of leaves), then i and j are neighboring leaves in the tree.
is a leaf
For a leaf , let
For leaves
2
( , ).
, :
( , ) ( ) ( , ) ( )
iu
i j
i r d i u
i j
Q i j n d i j r r
Definitions
22
S&N Neighbor Joining Algorithm If n =3, return tree of three vertices Compute Q(i,j) for all i,j Choose i,j such that Q(i,j) is minimal Create new vertex v, and set
ij
v
k
1 (for some
2 // or could be 0
1for each vertex ,
2
( , ) [ ( , ) ( , ) ( , )] )
( , ) ( , ) ( , ) ( , ) ( , )
( , ) [ ( , ) ( , ) ( , )]
d i v d i j d i r d j r r
d j v d i j d i v d i v d j v
k d v k d i k d j k d i j
remove i,j, and add v to the set of objectsRecursively construct a tree on the smaller set, then add i,j as children of v, at distances d(i,v) and d(j,v).
d(k,v)
23
Initialization: θ(n2) to compute r(i) and Q(i,j) for all i,jL.
Each Iteration: O(n2) to find the maximal Q(i,j). O(n) to compute {D(v,k):k L} for the new node v,
and to update the matrix. O(n2) to update the values Q(i,j).
Total of O(n3).
Complexity of S&N Neighbor Joining Algorithm
ij
k
D(v,k)
24
Some remarks on S&N Neighbor Joining Algorithm
Applicable to matrices which are not additive
Known to work good in practice.
The algorithm and its variants are the most widely used
distance-based algorithms today.
Next we present a more efficient Neighbor Joining
algorithm, which is based on LCA distances.
25
Least Common Ancestor distances
Definition: Given a weighted tree T and a specific vertex r in it:
dT(r;i j)=distance in T from r to path(i,j).
dT(r;i i)=distance in T from r to i.
A E
D
CB
r
3
55
2312
2
5
2
3Edge weights:
LCA distances:DT(r;AD)= 3
78
5
76DT(r;AA)= 7
26
Least Common Ancestor distances
The distances dT(r;i,j) can be presented by a matrix:
A B C D E
A 7 0 0 3 5
B 8 5 0 0
C 7 0 0
D 5 3
E 6A E
D
CB
r
3
55
78
675
27
LCA Matrices
Definition: A symmetric nonnegative matrix L is an LCA matrix iff
1. For each i: L(i,i)=maxj{L(i,j)}
2. It satisfies the “3 points condition”:
for each 3 distinct indices i, j, k ,
L(i,j) ≥ min {L(i,k), L(j,k)}
“the smallest value appears twice”
j k
i 11 9 6
j 8 6
28
LCA Matrices
j k
i 9 6
j 6
Theorem: The following conditions are equivalent for an (n-1)(n-1) matrix L:
1. L is an LCA matrix.
2. There is a weighted tree T with n leaves and a leaf r in T such that for each pair of leaves i,j r:
L(i,j)= dT(r;ij)
29
LCA distances LCA matrix
There is a weighted tree T s.t. L(i,j)= dT(r;ij).
L is an LCA matrix: By properties of least common ancestors in trees
ij
k
L(k,i) = L(j,i) L(k,j)
r
30
LCA matrix LCA distances
Now we are given an LCA matrix L and need to construct a tree. The construction uses “maximal off diagonal” entries:
L(i,j) is a “maximal off-diagonal” in entry in row i if L(i,j)=maxk{L(i,k):k i}
1 2 k
1 18 9 8 3 7
Example: L(1,2) is maximal off diagonal entry in row 1
31
Maximal off diagonal entries
Lemma: If L(i,j) is the maximal “off-diagonal” entry in both rows i and j in L, then for all k i,j: L(i,k)=L(j,k).
Proof: By the 3 points condition on {i,j,k}.
i j k
i 18 9 8 3 7
j 9 14 8 3 7
Example for i=1, j=2
32
LCA matrix LCA distances:Proof by induction
We now prove by induction on n: L is an (n-1)(n-1) LCA matrix
There is a weighted tree T with a root r as in the theorem.
Basis: n= 2. L=[w]. T is a tree with a single edge of weight w.
4r i4
34
Induction stepInduction step: n ¸ 3. Let L be an LCA matrix of
dimension n-1. We describe an algorithm for constructing the corresponding tree:
1. Find i,j s.t. L(i,j) is the maximal off-diagonal entry in L.
i j k
i 11 9
j 9 14
L
(In the example i=1 and j=2)
35
Induction step
2. Let L` be the matrix obtained by removing rows/columns i and j, and inserting row/column v s.t. L`(v,v)=L(i,j), and for k i,j,
L`(v,k)=L(i,k) (=L(j,k))
v k
v 9 8 3 7
L`
1 2 k
1 11 9 8 3 7
2 9 14 8 3 7
L
36
Induction Step
To show that L` We is an LCA matrix we need a definition and a
simple observation:
Definition: Let L be an nn matrix, and let S {1,...,n}.
L[S] is the submatrix of L consisting of the rows and columns with
indices from S.
Observation 1: If L is an LCA matrix then for every S {1,...,n},
L[S] is also an LCA matrix.
37
Induction step Claim: L` is an LCA matrix of dimension n-2 Proof: Let S be all leaves except j. Than L` is obtained from
L(S) as follows:1. change the index i to v2. set L`(v,v)Ã L(i,j)By Observation 1 and the maximality of L(i,j), L` is also an
LCA matrix.
v k
v 9 8 3 7
L`
1 2 k
1 11 9 8 3 7
2 9 14 8 3 7
L
38
Induction step
3. Construct a tree T` for L` (with n-1 leaves)
v k
v 9 8 3 7
v
T`L`
39
Induction step
4. Add to v to childs, for i and j, with appropriate edge lengths.
v
T`i j k
i 11 9 8 3 7
j 9 14 8 3 7
2 5
ij
40
Deepest LCA neighbor joining If n · 3, return tree of n vertices Prepare a list MAX of size n, s.t.
MAX(i ) = maximal off diagonal element in row i
Recursion: Find i,j s.t. L(i,j) is maximal off diagonal entry of L Make the reduction to L` as described update the list MAX (only MAX(v) needs an update!) Construct T` for L` Add i and j as childs of v.
v
T
`i j
41
Complexity AnalysisInitialization: Constructing MAX - O(n2).
Let Time(n) be the complexity of the algorithm, given the input matrix L and the list MAX. Time(n) is given by:
Reducing L to L`: O(n) Updating MAX: O(n). Constructing T` from L`: T(n-1). Constructing T from T`: O(1).
Time(n)· Time(n-1)+O(n)
Hence Time(n)=O(n2)
42
Seitou&Nei vs. DLCA methods
DLCA like S&N can be implemented on noisy data (in many ways)
On exact data, DLCA and S&N methods have the same (correct)
output. They differ on noisy data (which occurs in practice).
One basic difference: Unlike S&N method, the DLCA algorithm
depends on selecting a root. Hence DLCA may produce many
different trees on the same output.
Some of the projects will concentrate on this difference.
43
Incremental Reconstruction via Local Queries
Incrementally reconstructing the tree:
a
bc
d
ef
g
h
6
4 1
2
3
5
a
bc
d
ef
g
h
12 3
4 5
6
When inserting a new taxon x to a given topology T, we need to find out to which
edge x should be attached.
We are allowed to ask the ‘oracle’ local queries LQ(x,v).
(x – taxon, v – internal vertex)
44
Local Queries - Motivation
Asking LQ(x,v) is equivalent to asking the topology of {x, a, b, c},
where v is the center-point of a,b,c in T.
a
bc
d
ef
g
h
6
4 1
2
3
5
f
a
bc
d
e
12 3
4
Such questions can be asked directly (using likelihood) or through a pairwise
distance matrix (which will be discussed later)
45
Balancing Vertices
We’d like to minimize the number of queries required for inserting a
new taxon.
Lower bound – log3(|ET|). (simple adversarial argument)
Upper bound – log2(|ET|).
The algorithm which achieves the upper bound uses ‘balancing vertices’:
A balancing vertex in T is an internal vertex, which splits T into 3 subtrees
of size at most ceil(|T|/2).
Using balancing vertices in the local queries, the edge to which a new
taxon should be attached can be found in ~ log2(|ET|) queries.
46
Balancing Vertices
Every tree contains either a single balancing vertex or two adjacent
balancing vertices.
Finding a balancing vertex:
Start at some arbitrary vertex v. If v is balancing, stop.
Otherwise, continue to the vertex u, adjacent to v in the ‘heaviest’ subtree.
The algorithm traverses each edge at most once
Time complexity – O(|T|).
a
c
d
ef
g
h
13 edges in T
11 edges9 edges7 edges
47
A Simple and Efficient Algorithm
Iteratively add taxa 1,2,…,n to the topology
When adding taxon x to topology T:
If T is trivial (consists of a single edge), attach x to that edge.
Otherwise: Find a balancing vertex v of T.
Ask query LQ(v,x)
Continue recursively on T’, the subtree corresponding to the answer of the query.
Complexity:
Adding taxon 1≤x≤n to T takes O(log(x)) queries and O(x) time.
Total query complexity: O(n·log(n))
Total time complexity: O(n2)
48
Interesting Issues
Two major issues are raised in this area:
Queries do not always have reliable answers- Use confidence level for answers
- Verify the answers
Reduce running time to O(n·log(n))- Finding balancing vertices leads to high overhead
- Maybe we don’t have to re-compute the balancing vertices in every stage
49
Robustness to Noise in Data
Answering local queries using a distance matrix D: We wish to assess the topology spanned by four taxa: x, a, b, c.
Observe the 4×4 submatrix of D over x, a, b, c:
a
bc
x
b x a c
bx
ac
If D is additive then there is a labeling of the taxa by i, j, k, l s.t:
D(i,j) + D(k,l) ≤ D(i,k) + D(j,l) = D(i,l) + D(j,k)
The configuration of the quartet is (ij ; kl), and the path separating them is of
length ½(D(i,k) + D(j,l) - D(i,j) + D(k,l))
If D is not additive we set the configuration of the quartet to (ij ; kl),
where D(i,j) + D(k,l) is minimal of the three sums.
Confidence of prediction can be estimated by the difference between
maximal and minimal sums.
?
50
Robustness to Noise in Data
Answering local queries using a distance matrix D: We can check several quartets of type x, a, b, c to answer a single local query.
Example: To answer LQ(1,g) we can check all quartets in
{g} ×{a} ×{c,f} ×{b,d,e}
We can choose a representative set of quartets, and answer the local
query according to (weighted) majority.
If the answer is still inconclusive, we can choose to ask another local
query.
a
bc
d
ef
12 3
4
g?
51
Improving Running Time
Separator Trees: A deterministic algorithm which inserts a new taxon x to a given topology T
can be viewed as a rooted decision tree.
• Each internal node represents a local query (internal vertex in T).
• Each internal node has three outgoing edges corresponding to the three possible
answers to the query.
• Each leaf corresponds to an edge in T.
A special case of decision trees are separator trees.
The time complexity of the algorithm is the depth of the separator
tree
a b c d e f g h i j k l m
1 2 5
3 6
S:4
a
b
d
ef
g
h
i
jl
mk
1
23
4
5
6
T:
c
52
a b c d e f g h i j k l m
Improving Running Time
Balanced Separator Trees: A balanced separator tree uses balancing vertices (of the appropriate subtrees of T)
Can be constructed in O(n·log(n)) time
Inserting a taxon does not drastically harm the balance
If we allow some imbalance, we can guarantee that the costly balancing
procedure is executed few times during construction of the whole topology.
Amortized analysis of total time complexity: O(n·log2(n))
a
b
c
d
ef
g
h
i
jl
mk
1
23
4
5
6
1 2 5
3 6
T: S:4
53
Improving Running TimeBottom-up approach: (simple separator trees)
Start with the edge-set of T
Choose disjoint edge triplets, s.t. that each triplet contains at least one leaf
Contract each triplet to a single edge
Recursively continue on the reduced topology
T: S:
a
b
c
d
ef
g
h
i
jl m
k
1
23
4
5
6
j
1
2 3 4 5 6
j3
56
5
a b c d e f g h i j k l m
1 2 4 6j
3 6j
5
54
Improving Running Time
Bottom-up approach: (simple separator trees) By simple linear traversal of T you can find θ(|T|) edge-triplets
Topology size is reduced by a constant factor each stage
• Depth of simple separator tree is O(log(n))
• Time complexity is O(n).
Insertion of taxon induces modifications propagating bottom-up through the
layers of the separator tree
a
b
c
d
ef
g
h
i
jl m
k
1
23
4
5
6
j
1
2 3 4 5 6
j3
56
5
a b c d e f g h i j k l m
1 2 4 6j
3 6j
5
IS: {1,2,4,6}
IS: {3}
IS: {5}
55
ATTCG …ATACG …ACTGG …...
Testing Reconstruction Methods on Noisy Data
We’d like to test reconstruction algorithms on actual phylogenetic data.
Problem: Confirmed phylogenetic trees are scarce and small.
Solution: Simulate the data.
Generate an edge-weighted tree under some probabilistic model
(Yule-Harding)
Choose random DNA string for root and simulate evolution on tree to obtain sequences for all leaves
SeqGenDNAdist
from
PHYLIP
Obtain pariwise distances from
sequences
00
00
00
00
0
TD
T’Reconstruction
AlgorithmCompare topologies
56
The ProjectsProject I: The DLCA algorithm
Implement algorithms: Saitou&Nei's neighbor joining DLCA neighbor-joining
mid-point reduction maximal-value reduction
Simulate data:Use pre-generated trees to simulate process of evolution (using SeqGen program)For each tree generate several sequence-sets Experiments:Test the various algorithms on the generated data:
Use DNADIST program (part of the Phylip package) to get a distance matrix corresponding to the sequence-set of the leaves.
Execute algorithms on distance matrix Check topological accuracy using the RF-score
57
The Projects
Project II: Fast Algorithms Using Local Queries
Implement algorithms: Implement advanced data structures which support the various algorithms: Algorithm using semi-balanced separator trees Algorithm using simple separator trees
Simulate data:Use pre-generated trees and/or uniform random model
Experiments: Test the various algorithms on the generated trees:
o Use the generated trees to answer the local queries asked by the algorithms.
o Compare the performance of the different algorithms on this data.
58
The Projects
Project III: Robust Algorithms Using Local Queries
Implement algorithms: Implement the O(n2) algorithm using O(n·log(n)) queries
Simulate data:Use pre-generated trees and distance matrices
Experiments: Test various approaches on the generated data:
o Use the distance-matrices to answer the local queries asked by the algorithms.
o Suggest some method of estimating the confidence level of an answer to a query.
o Check for errors in the reconstructed topology. Compare several approaches
59
Grading Scheme
10% - work plan 60% - final report + submitted code
Rough distribution of grade: 40% - meeting project requirements 10% - code organization and documentation 10% - innovation and creativeness
30% - final presentation
60
Schedule
21/3 – Introductory meeting
28/3 – Deadline for choosing a project
26-30/3 – Individual 30 minute meetings with each teem to discuss the
specification of the project.
23-27/4 – Individual 60 minute meeting with each team to discuss work
plan and design of project
2/5 – Deadline for submitting work plan
21-25/5 – Individual progress meetings
18-22/6 – Concluding 60 minute meetings with each team
27/6, 4/7 – Project presentations and submission of final draft
Final submission deadline – To be announced
61
Homework
Team up in pairs
Choose project
Send me e-mail containing:
The names, id numbers, e-mails of all students in the group
Preferred project + 2nd priority project
Two optional dates for first project meeting (next week)
Go over references of your chosen project
Good Luck !
Top Related