Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG...
-
Upload
ambrose-collins -
Category
Documents
-
view
220 -
download
1
Transcript of Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG...
Multiple Sequence Alignment
S1=AGGTC
S2=GTTCG
S3=TGAACPossible alignment
A-T
GGG
G--
TTA
-TA
CCC
-G-
Possible alignment
AG-
GTT
GTG
T-A
--A
CCA
-GC
Multiple Sequence Alignment (cont)
Input: Sequences S1 , S2 ,…, Sk over the same alphabet
Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length
1. |S’1|= |S’2|=…= |S’k|
2. Removal of spaces from S’i obtains Si
Sum-of-pairs (SP) score for a multiple global alignment
is the sum of scores of all pairwise alignments induced
by it.
Consider the following alignment:
AC-CDB--C-ADBDA-BCDAD
Multiple Sequence Alignment Example
Scoring scheme: match - 0mismatch/indel - -1
SP score: -3 -5 -4=-12
Given k strings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment:
• Instead of a 2-dimensional table we have a k-dimensional table
• Each dimension is of length ‘n’+1
• Each entry depends on 2k-1 adjacent entries
Complexity: O(2knk)
This problem is known to be NP-hard (no polynomial-time algorithm)
Multiple Sequence AlignmentComplexity
Multiple Sequence Alignment Approximation AlgorithmWe use cost instead of score
Find alignment of minimal cost
Assumption: the cost function δ is a distance function
• δ(x,x) = 0
• δ(x,y) = δ(y,x) ≥ 0
• δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality)(e.g. cost of MM ≤ cost of two indels)
D(S,T) - cost of minimum global alignment between S and T
The ‘star’ algorithm:
Input: Γ - set of k strings S1, …,Sk.
1. Find the string S’ (center) that minimizes
2. Denote S1=S’ and the rest of the strings as S2, …,Sk
3. Iteratively add S2, …,Sk to the alignment as follows:
• Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1
• Align Si to S’1 to produce S’i and S’’1 aligned
• Adjust S’2, …,S’i-1 by adding spaces where spaces were added to
S’’1
• Replace S’1 by S’’1
'\
,'SS
SSD
Multiple Sequence Alignment Approximation Algorithm
Time analysis:
• Choosing S1 – execute DP for all sequence-pairs - O(k2n2)
• Adding Si to the alignment - execute DP for Si , S’1 - O(i·n2).
(In the ith stage the length of S’1 can be up-to i· n)
1
1
222k
i
nkOniO
Multiple Sequence Alignment Approximation Algorithm
total complexity
For all i: d(1,i)=D(S1,Si)
(we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )
Multiple Sequence Alignment Approximation Algorithm
Approximation ratio:
• M* - optimal alignment• M - The alignment produced by this algorithm
• d(i,j) - the distance M induces on the pair Si,Sj
•
ji
k
i
k
ijj
jidjidMv ,2,1 1
Multiple Sequence Alignment Approximation Algorithm
Approximation ratio:
k
llSSDk
21,)1(2
k
jjSSDk
21,
2)1(2
)(
)(*
k
k
Mv
Mv
k
i
k
ijj
jidMv1 1
,
k
i
k
ijj
jdid1 1
,1,1
k
l
ldk2
,1)1(2
k
i
k
ijj
jidMv1 1
** ,
k
i
k
ijj
ji SSD1 1
,
k
i
k
jjSSD
1 21 ,
Definition of S1:
k
ijj
ji
k
jj SSDSSDi
121 ,,:
Triangle inequality
Multiple Sequence AlignmentReminder
S1=AGGTC
S2=GTTCG
S3=TGAACPossible alignment
A-T
GGG
G--
TTA
-TA
CCC
-G-
Possible alignment
AG-
GTT
GTG
T-A
--A
CCA
-GC
Input: Sequences S1 , S2 ,…, Sk over the same alphabet
Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length
1. |S’1|= |S’2|=…= |S’k|
2. Removal of spaces from S’i obtains Si
Sum-of-pairs (SP) score for a multiple global alignment
is the sum of scores of all pairwise alignments induced
by it.
Multiple Sequence AlignmentReminder
The ‘star’ algorithm:
Input: Γ - set of k strings S1, …,Sk.
1. Find the string S1 (center) that minimizes
2. Iteratively add S2, …,Sk to the alignment
Finds MA costing at most twice the optimal cost!
'\
,'SS
SSD
Multiple Sequence AlignmentReminder
Problem: Conventional MA does not model correctly evolutionary relationships
Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X)
Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal.
How do we label internal vertices?
• Sequences
• Profiles (multiple alignments)
Tree Alignment
A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table.
Column i holds the distribution of Σ (and gap) in that position
Profile Alignment
A-T
GGG
G--
TTA
-TA
CCC
-G-
A 1 0 0 1 1 0 0
T 1 0 0 2 1 0 0
G 0 3 1 0 0 0 1
C 0 0 0 0 0 3 0
- 1 0 2 0 1 0 2
: 3
Aligning a sequence to a profile:
• Matching letter to position: weighted average of scores
• Indels: introducing new columns gets special consideration
(same goes for aligning two profiles)
Profile Alignment
A 1 0 0 1 1 0 0
T 1 0 0 2 1 0 0
G 0 3 1 0 0 0 1
C 0 0 0 0 0 3 0
- 1 0 2 0 1 0 2
: 3
Iteratively constructs MA for intermediate nodes• At each point holds profiles for all leaves
• Chooses closest pair of neighbors
- neighbors – have common father in T
- distance - cost of optimal (pairwise) alignment
• Aligns the two profiles to get the ‘father-profile’
• Replaces the two leaves with their father
Analysis: • Initialization – O(k2) alignments
• k-1 iterations
• Iteration i involves k-i-1 new pairwise alignments
Clustal Algorithm
ClustalW – more advanced version.Sequences/profiles are weighted
Lifted Tree Alignments
Lifted tree alignment – each internal node is labeled by one of the labels of its daughters
Internal nodes are sequences and not profiles
Example:
S1 S2 S3 S4 S6S5
S2
S4
S4
S5
We’ll show:1. DP algorithm for optimal
lifted tree alignment2. Optimal lifted alignment
is 2-approximation of optimal tree alignment
Lifted Tree AlignmentsAlgorithmInput: X - set of sequences
T – phylogenetic tree on X (leaves labeled by X)
Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal.
Basic principle: calculate for every node v in T, and sequence S in X:
d(v,S) - the optimal cost of v’s subtree when it is labeled by S
The cost of optimal tree is
S1 S2 S3 S4 S6S5
S2
S4
S4
S5
),(min SrootdXS
Lifted Tree AlignmentsAlgorithm
d(v,S) - the optimal cost of v’s subtree when it is labeled by S
Initialization: for leaf v labeled Sv -
Recurrence: for internal node v with daughters u1,…ul -
Correctness: check for suboptimal solution property Complexity: O(k2) pairwise alignments - O(n2k2) .
k-1 iterationsFor internal node v - O(kv
2) work
Total: O(k2(n2+depth(T)))
S1 S2 S3 S4 S6S5
S2
S4
S4
S5
l
ii
XSSudSSDSvd
1'
)',()',(min),(
v
v
SS
SSSvd
|
|0),(
O(k2depth(T))=O(k3)
Lifted Tree AlignmentsApproximation analysis
Claim: Optimal LTA 2-approximates general tree alignments
• We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes
(? can be generalized for profile-labeled nodes ?)
Notations:
• T* - optimal TA labels
• Sv* - label of node v in T*
• TL – our constructed LTA
• SvL - label of node v in TL
S1 S2 S3 S4 S6S5
S2
S4
S4
S5
Lifted Tree AlignmentsApproximation analysis
Construction:
• We label the nodes bottom-up.
• For node v with daughters u1,…ul –
we choose the label (from Su1L ,…,Sul
L) closest to Sv*
We need to show: D(TL) ≤ 2D(T*)
S1 S2 S3 S4 S6S5
S2
S4
S4
S5
Lifted Tree AlignmentsApproximation analysis
Analysis:
• Some edges in TL have cost 0
• Observe edges (v,u) of cost > 0:
• Si- label of father(v)
• Sj- label of daughter (u)
• P(v,u) – the path in T* from v to the leaf labeled by Sj
D(Si,Sj) ≤ D(Si,Sv*) + D(Sj,Sv*) ≤ 2D(Sj,Sv*) ≤ 2D(P(v,u))
S1 S2 S3 S4 S6S5
S2
S4
S4
S5
triangle inequality choice of i triangle inequality
Lifted Tree AlignmentsApproximation analysis
D(Si,Sj) ≤ 2D(P(v,u))
S1 S2 S3 S4 S6S5
S2
S4
S4
S5
If (u,v) and (u’,v’) are two different edges with cost > 0 in TL, then P(u,v) and P(u’,v’) are mutually disjoint in edges
Final Remarks:
• Lifted tree alignment TL is only conceptual (we don’t have T*)
• Optimal LTA cannot cost more than TL
• In case of profile-labeled nodes:
construction and analysis OK when cost is still distance function
Q.E.D.