Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG...

23
Multiple Sequence Alignment S 1 =AGGTC S 2 =GTTCG S 3 =TGAAC Possible alignment A - T G G G G - - T T A - T A C C C - G - Possible alignment A G - G T T G T G T - A - - A C C A - G C

Transcript of Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG...

Page 1: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Multiple Sequence Alignment

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

Page 2: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Multiple Sequence Alignment (cont)

Input: Sequences S1 , S2 ,…, Sk over the same alphabet

Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length

1. |S’1|= |S’2|=…= |S’k|

2. Removal of spaces from S’i obtains Si

Sum-of-pairs (SP) score for a multiple global alignment

is the sum of scores of all pairwise alignments induced

by it.

Page 3: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Consider the following alignment:

AC-CDB--C-ADBDA-BCDAD

Multiple Sequence Alignment Example

Scoring scheme: match - 0mismatch/indel - -1

SP score: -3 -5 -4=-12

Page 4: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Given k strings of length n, there is a generalization of the DP algorithm that finds an optimal SP alignment:

• Instead of a 2-dimensional table we have a k-dimensional table

• Each dimension is of length ‘n’+1

• Each entry depends on 2k-1 adjacent entries

Complexity: O(2knk)

This problem is known to be NP-hard (no polynomial-time algorithm)

Multiple Sequence AlignmentComplexity

Page 5: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Multiple Sequence Alignment Approximation AlgorithmWe use cost instead of score

Find alignment of minimal cost

Assumption: the cost function δ is a distance function

• δ(x,x) = 0

• δ(x,y) = δ(y,x) ≥ 0

• δ(x,y) + δ(y,z) ≥ δ(x,z) (triangle inequality)(e.g. cost of MM ≤ cost of two indels)

D(S,T) - cost of minimum global alignment between S and T

Page 6: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

The ‘star’ algorithm:

Input: Γ - set of k strings S1, …,Sk.

1. Find the string S’ (center) that minimizes

2. Denote S1=S’ and the rest of the strings as S2, …,Sk

3. Iteratively add S2, …,Sk to the alignment as follows:

• Suppose S1, …,Si-1 are already aligned as S’1, …,S’i-1

• Align Si to S’1 to produce S’i and S’’1 aligned

• Adjust S’2, …,S’i-1 by adding spaces where spaces were added to

S’’1

• Replace S’1 by S’’1

'\

,'SS

SSD

Multiple Sequence Alignment Approximation Algorithm

Page 7: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Time analysis:

• Choosing S1 – execute DP for all sequence-pairs - O(k2n2)

• Adding Si to the alignment - execute DP for Si , S’1 - O(i·n2).

(In the ith stage the length of S’1 can be up-to i· n)

1

1

222k

i

nkOniO

Multiple Sequence Alignment Approximation Algorithm

total complexity

Page 8: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

For all i: d(1,i)=D(S1,Si)

(we perform optimal alignment between S’1 and Si and δ(-,-) = 0 )

Multiple Sequence Alignment Approximation Algorithm

Approximation ratio:

• M* - optimal alignment• M - The alignment produced by this algorithm

• d(i,j) - the distance M induces on the pair Si,Sj

ji

k

i

k

ijj

jidjidMv ,2,1 1

Page 9: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Multiple Sequence Alignment Approximation Algorithm

Approximation ratio:

k

llSSDk

21,)1(2

k

jjSSDk

21,

2)1(2

)(

)(*

k

k

Mv

Mv

k

i

k

ijj

jidMv1 1

,

k

i

k

ijj

jdid1 1

,1,1

k

l

ldk2

,1)1(2

k

i

k

ijj

jidMv1 1

** ,

k

i

k

ijj

ji SSD1 1

,

k

i

k

jjSSD

1 21 ,

Definition of S1:

k

ijj

ji

k

jj SSDSSDi

121 ,,:

Triangle inequality

Page 10: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Multiple Sequence AlignmentReminder

S1=AGGTC

S2=GTTCG

S3=TGAACPossible alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

Possible alignment

AG-

GTT

GTG

T-A

--A

CCA

-GC

Page 11: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Input: Sequences S1 , S2 ,…, Sk over the same alphabet

Output: Gapped sequences S’1 , S’2 ,…, S’k of equal length

1. |S’1|= |S’2|=…= |S’k|

2. Removal of spaces from S’i obtains Si

Sum-of-pairs (SP) score for a multiple global alignment

is the sum of scores of all pairwise alignments induced

by it.

Multiple Sequence AlignmentReminder

Page 12: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

The ‘star’ algorithm:

Input: Γ - set of k strings S1, …,Sk.

1. Find the string S1 (center) that minimizes

2. Iteratively add S2, …,Sk to the alignment

Finds MA costing at most twice the optimal cost!

'\

,'SS

SSD

Multiple Sequence AlignmentReminder

Problem: Conventional MA does not model correctly evolutionary relationships

Page 13: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Input: X - set of sequences T – phylogenetic tree on X (leaves labeled by X)

Output: labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal.

How do we label internal vertices?

• Sequences

• Profiles (multiple alignments)

Tree Alignment

Page 14: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

A profile of a MA of length n over alphabet Σ is a (| Σ |+1)*n table.

Column i holds the distribution of Σ (and gap) in that position

Profile Alignment

A-T

GGG

G--

TTA

-TA

CCC

-G-

A 1 0 0 1 1 0 0

T 1 0 0 2 1 0 0

G 0 3 1 0 0 0 1

C 0 0 0 0 0 3 0

- 1 0 2 0 1 0 2

: 3

Page 15: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Aligning a sequence to a profile:

• Matching letter to position: weighted average of scores

• Indels: introducing new columns gets special consideration

(same goes for aligning two profiles)

Profile Alignment

A 1 0 0 1 1 0 0

T 1 0 0 2 1 0 0

G 0 3 1 0 0 0 1

C 0 0 0 0 0 3 0

- 1 0 2 0 1 0 2

: 3

Page 16: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Iteratively constructs MA for intermediate nodes• At each point holds profiles for all leaves

• Chooses closest pair of neighbors

- neighbors – have common father in T

- distance - cost of optimal (pairwise) alignment

• Aligns the two profiles to get the ‘father-profile’

• Replaces the two leaves with their father

Analysis: • Initialization – O(k2) alignments

• k-1 iterations

• Iteration i involves k-i-1 new pairwise alignments

Clustal Algorithm

ClustalW – more advanced version.Sequences/profiles are weighted

Page 17: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Lifted Tree Alignments

Lifted tree alignment – each internal node is labeled by one of the labels of its daughters

Internal nodes are sequences and not profiles

Example:

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

We’ll show:1. DP algorithm for optimal

lifted tree alignment2. Optimal lifted alignment

is 2-approximation of optimal tree alignment

Page 18: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Lifted Tree AlignmentsAlgorithmInput: X - set of sequences

T – phylogenetic tree on X (leaves labeled by X)

Output: lifted labels on internal vertices of T, s.t. sum of costs of all edges of T is minimal.

Basic principle: calculate for every node v in T, and sequence S in X:

d(v,S) - the optimal cost of v’s subtree when it is labeled by S

The cost of optimal tree is

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

),(min SrootdXS

Page 19: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Lifted Tree AlignmentsAlgorithm

d(v,S) - the optimal cost of v’s subtree when it is labeled by S

Initialization: for leaf v labeled Sv -

Recurrence: for internal node v with daughters u1,…ul -

Correctness: check for suboptimal solution property Complexity: O(k2) pairwise alignments - O(n2k2) .

k-1 iterationsFor internal node v - O(kv

2) work

Total: O(k2(n2+depth(T)))

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

l

ii

XSSudSSDSvd

1'

)',()',(min),(

v

v

SS

SSSvd

|

|0),(

O(k2depth(T))=O(k3)

Page 20: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Lifted Tree AlignmentsApproximation analysis

Claim: Optimal LTA 2-approximates general tree alignments

• We’ll show construction of LTA which costs at most twice the optimal TA with sequence-labeled nodes

(? can be generalized for profile-labeled nodes ?)

Notations:

• T* - optimal TA labels

• Sv* - label of node v in T*

• TL – our constructed LTA

• SvL - label of node v in TL

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

Page 21: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Lifted Tree AlignmentsApproximation analysis

Construction:

• We label the nodes bottom-up.

• For node v with daughters u1,…ul –

we choose the label (from Su1L ,…,Sul

L) closest to Sv*

We need to show: D(TL) ≤ 2D(T*)

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

Page 22: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Lifted Tree AlignmentsApproximation analysis

Analysis:

• Some edges in TL have cost 0

• Observe edges (v,u) of cost > 0:

• Si- label of father(v)

• Sj- label of daughter (u)

• P(v,u) – the path in T* from v to the leaf labeled by Sj

D(Si,Sj) ≤ D(Si,Sv*) + D(Sj,Sv*) ≤ 2D(Sj,Sv*) ≤ 2D(P(v,u))

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

triangle inequality choice of i triangle inequality

Page 23: Multiple Sequence Alignment S 1 = AGGTC S 2 = GTTCG S 3 = TGAAC Possible alignment A-TA-T GGGGGG G--G-- TTATTA -TA-TA CCCCCC -G--G- AG-AG- GTTGTT GTGGTG.

Lifted Tree AlignmentsApproximation analysis

D(Si,Sj) ≤ 2D(P(v,u))

S1 S2 S3 S4 S6S5

S2

S4

S4

S5

If (u,v) and (u’,v’) are two different edges with cost > 0 in TL, then P(u,v) and P(u’,v’) are mutually disjoint in edges

Final Remarks:

• Lifted tree alignment TL is only conceptual (we don’t have T*)

• Optimal LTA cannot cost more than TL

• In case of profile-labeled nodes:

construction and analysis OK when cost is still distance function

Q.E.D.