Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG...

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

Multiple Sequence Alignment (cont)

Input: Sequences S1 , S2 ,…, Sk over the same alphabet

Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length

1. |S’1|= |S’2|=…= |S’k|

2. Removal of spaces from S’i obtains Si

Sum-of-pairs (SP) score for a multiple global alignment

is the sum of scores of all pairwise alignments induced

by it.

Consider the following alignment:

AC-CDB--C-ADBDA-BCDAD

Multiple Sequence Alignment Example

Scoring scheme: match - 0mismatch/indel - -1

SP score: -3 -5 -4=-12

Given k strings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment:

• Instead of a 2-dimensional table we have a k-dimensional table

• Each dimension is of length ‘n’+1

• Each entry depends on 2k-1 adjacent entries

Complexity: O(2knk)

This problem is known to be NP-hard (no polynomial-time algorithm)

Multiple Sequence AlignmentComplexity

Multiple Sequence Alignment Approximation AlgorithmWe use cost instead of score

Find alignment of minimal cost

Assumption: the cost function δ is a distance function

• δ(x,x) = 0

• δ(x,y) = δ(y,x) ≥ 0

• δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality)(e.g. cost of MM ≤ cost of two indels)

D(S,T) - cost of minimum global alignment between S and T

The ‘star’ algorithm:

Input: Γ - set of k strings S1, …,Sk.

1. Find the string S’ (center) that minimizes

2. Denote S1=S’ and the rest of the strings as S2, …,Sk

3. Iteratively add S2, …,Sk to the alignment as follows:

• Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1

• Align Si to S’1 to produce S’i and S’’1 aligned

• Adjust S’2, …,S’i-1 by adding spaces where spaces were added to

S’’1

• Replace S’1 by S’’1

'\

,'SS

SSD

Multiple Sequence Alignment Approximation Algorithm

Time analysis:

• Choosing S1 – execute DP for all sequence-pairs - O(k2n2)

• Adding Si to the alignment - execute DP for Si , S’1 - O(i·n2).

(In the ith stage the length of S’1 can be up-to i· n)

1

1

222k

i

nkOniO


total complexity

For all i: d(1,i)=D(S1,Si)

(we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )


Approximation ratio:

• M* - optimal alignment• M - The alignment produced by this algorithm

• d(i,j) - the distance M induces on the pair Si,Sj

•

ji

k

i

k

ijj

jidjidMv ,2,1 1


Approximation ratio:

k

llSSDk

21,)1(2

k

jjSSDk

21,

2)1(2

)(

)(*

k

k

Mv

Mv

k

i

k

ijj

jidMv1 1

,

k

i

k

ijj

jdid1 1

,1,1

k

l

ldk2

,1)1(2

k

i

k

ijj

jidMv1 1

** ,

k

i

k

ijj

ji SSD1 1

,

k

i

k

jjSSD

1 21 ,

Definition of S1:

k

ijj

ji

k

jj SSDSSDi

121 ,,:

Triangle inequality

Multiple Sequence AlignmentReminder

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

Input: Sequences S1 , S2 ,…, Sk over the same alphabet

Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length

1. |S’1|= |S’2|=…= |S’k|

2. Removal of spaces from S’i obtains Si

Sum-of-pairs (SP) score for a multiple global alignment

is the sum of scores of all pairwise alignments induced

by it.


The ‘star’ algorithm:

Input: Γ - set of k strings S1, …,Sk.

1. Find the string S1 (center) that minimizes

2. Iteratively add S2, …,Sk to the alignment

Finds MA costing at most twice the optimal cost!

'\

,'SS

SSD


Problem: Conventional MA does not model correctly evolutionary relationships

Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X)

Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal.

How do we label internal vertices?

• Sequences

• Profiles (multiple alignments)

Tree Alignment

A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table.

Column i holds the distribution of Σ (and gap) in that position

Profile Alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

A 1 0 0 1 1 0 0

T 1 0 0 2 1 0 0

G 0 3 1 0 0 0 1

C 0 0 0 0 0 3 0

- 1 0 2 0 1 0 2

: 3

Aligning a sequence to a profile:

• Matching letter to position: weighted average of scores

• Indels: introducing new columns gets special consideration

(same goes for aligning two profiles)

Profile Alignment

A 1 0 0 1 1 0 0

T 1 0 0 2 1 0 0

G 0 3 1 0 0 0 1

C 0 0 0 0 0 3 0

- 1 0 2 0 1 0 2

: 3

Iteratively constructs MA for intermediate nodes• At each point holds profiles for all leaves

• Chooses closest pair of neighbors

- neighbors – have common father in T

- distance - cost of optimal (pairwise) alignment

• Aligns the two profiles to get the ‘father-profile’

• Replaces the two leaves with their father

Analysis: • Initialization – O(k2) alignments

• k-1 iterations

• Iteration i involves k-i-1 new pairwise alignments

Clustal Algorithm

ClustalW – more advanced version.Sequences/profiles are weighted

Lifted Tree Alignments

Lifted tree alignment – each internal node is labeled by one of the labels of its daughters

Internal nodes are sequences and not profiles

Example:

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

We’ll show:1. DP algorithm for optimal

lifted tree alignment2. Optimal lifted alignment

is 2-approximation of optimal tree alignment

Lifted Tree AlignmentsAlgorithmInput: X - set of sequences

T – phylogenetic tree on X (leaves labeled by X)

Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal.

Basic principle: calculate for every node v in T, and sequence S in X:

d(v,S) - the optimal cost of v’s subtree when it is labeled by S

The cost of optimal tree is

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

),(min SrootdXS

Lifted Tree AlignmentsAlgorithm

d(v,S) - the optimal cost of v’s subtree when it is labeled by S

Initialization: for leaf v labeled Sv -

Recurrence: for internal node v with daughters u1,…ul -

Correctness: check for suboptimal solution property Complexity: O(k2) pairwise alignments - O(n2k2) .

k-1 iterationsFor internal node v - O(kv

2) work

Total: O(k2(n2+depth(T)))

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

l

ii

XSSudSSDSvd

1'

)',()',(min),(

v

v

SS

SSSvd

|

|0),(

O(k2depth(T))=O(k3)

Lifted Tree AlignmentsApproximation analysis

Claim: Optimal LTA 2-approximates general tree alignments

• We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes

(? can be generalized for profile-labeled nodes ?)

Notations:

• T* - optimal TA labels

• Sv* - label of node v in T*

• TL – our constructed LTA

• SvL - label of node v in TL

S1 S2 S3 S4 S6S5

S2

S4

S4

S5


Construction:

• We label the nodes bottom-up.

• For node v with daughters u1,…ul –

we choose the label (from Su1L ,…,Sul

L) closest to Sv*

We need to show: D(TL) ≤ 2D(T*)

S1 S2 S3 S4 S6S5

S2

S4

S4

S5


Analysis:

• Some edges in TL have cost 0

• Observe edges (v,u) of cost > 0:

• Si- label of father(v)

• Sj- label of daughter (u)

• P(v,u) – the path in T* from v to the leaf labeled by Sj

D(Si,Sj) ≤ D(Si,Sv*) + D(Sj,Sv*) ≤ 2D(Sj,Sv*) ≤ 2D(P(v,u))

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

triangle inequality choice of i triangle inequality


D(Si,Sj) ≤ 2D(P(v,u))

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

If (u,v) and (u’,v’) are two different edges with cost > 0 in TL, then P(u,v) and P(u’,v’) are mutually disjoint in edges

Final Remarks:

• Lifted tree alignment TL is only conceptual (we don’t have T*)

• Optimal LTA cannot cost more than TL

• In case of profile-labeled nodes:

construction and analysis OK when cost is still distance function

Q.E.D.

Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG...

Documents

Transcript of Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG...