Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...

45
Pairwise Sequence alignment Basic Algorithms

Transcript of Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...

Page 1: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Pairwise Sequence alignment Basic Algorithms

Page 2: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Agenda - Previous Lesson: Minhala

- + Biological Story on Biomolecular Sequences

- + General Overview of Problems in Computational

Biology

Today:

- Reminder: Dynamic Programming

-Algorithms for Global and Local Sequence Alignment + variants

-Bioinformatic Motivation for Sequence Alignment

Page 3: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

3

Literature list

• Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology of the Cell.

• Mount, D.W. Bioinformatics: Sequence and Genome Analysis.

• Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms.

• R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.

• Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology.

Page 4: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

• Move to Slides On Dynamic Programming…

Page 5: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

5/64

Sequence Comparison (cont)

• We seek the following similarities between sequences :

• Find similar proteins – Allows to predict function & structure

• Locate similar subsequences in DNA – Allows to identify (e.g) regulatory elements

• Locate DNA sequences that might overlap – Helps in sequence assembly

g1

g2

Page 6: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Sequence Modifications

• Three types of changes

– Substitution (point mutation)

– Insertion

– Deletion

6

TCAGT TCGAGT

TCCGT

TCGT

TCAGT

Indel (replication slippage)

Page 7: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

7/64

Choosing Alignments

There are many possible alignments For example, compare:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

to ------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Which one is better?

Page 8: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

8/64

Another example

Given two sequences:

X: TGCATAT

Y: ATCCGAT

Question:

How can X be transformed into Y?

Or,

How did Y evolve from X?

Page 9: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

9/64

TGCATAT

TGCATA

TGCAT

ATGCAT

ATCCAT

ATCCGAT

delete T

delete A

insert A

G C

insert G

One possible transformation

Alignment:

-TGC-ATAT

ATCCGAT--

5 operations

Page 10: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

10/64

-TGCATAT

ATCCG-AT

TGCATAT

ATGCATAT

ATGCAAT

ATGCGAT

ATCCGAT

insert A

delete T

A G

Another possible transformation

Alignment:

4 operations

G C

Which one is better?

Page 11: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

In order to align two sequences we need a quantitive model to evaluate similarity between sequences.

11

How do we quantitate sequence similarity?

Page 12: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Scoring Similarity

• Assume independent mutation model

– Each site considered separately

• Score at each site

– Positive if the same

– Negative if different

• Sum to make final score

– Can be positive or negative

– Significance depends on sequence length

12

GTAGTC

CTAGCG

Page 13: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Pairwise Alignment - Identity

(HH) VLSPADKTNVKAAWGKVGAHAGYEG

||| | | || | |

(SWM) VLSEGEWQLVLHVWAKVEADVAGHG

• Percent Identity: 36.000 (| only)

Human Hemoglobin (HH) vs Sperm Whale Myoglobin (SWM):

Page 14: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Pairwise Alignment - Similarity

(HH) VLSPADKTNVKAAWGKVGAHAGYEG

||| . | | || | |

(SWM) VLSEGEWQLVLHVWAKVEADVAGHG

• Percent Similarity: 40.000 (| and .)

• Percent Identity: 36.000 (| only)

D and E are similar:

1. structure is similar.

2. both are acidic and hydrophilic

3. one mutation can separate them

from one to the other.

Page 15: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Pairwise Alignment – Gap insertion

(HH) VLSPADKTNVKAAWGKVGAH-AGYEG

.

(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G

• Gaps: 2

• Percent Similarity: 54.167

• Percent Identity: 45.833 (12/26)

Page 16: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Pairwise Alignment - Scoring

• The final score of the alignment is the sum of the positive scores and penalty scores:

+ Number of Identities

+ Number of Similarities

- Number of gap insertions

Alignment score

Page 17: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Pairwise Alignment - Scoring

(HH) VLSPADKTNVKAAWGKVGAH-AGYEG

||| . | | || || |

(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G

Final score:

(V,V) + (L,L) + (S,S) + (D,E) + … - (penalty for gap insertion)*(number of gaps) - (penalty for gap extension)*(extension length)

We are interested in both the score and the alignment trace.

Page 18: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

18

Optimum Alignment

The score of an alignment is a measure of its

quality

Optimum alignment problem: Given a pair of

sequences X and Y, find an alignment (global or

local) with maximum score

The similarity between X and Y, denoted

sim(X,Y), is the maximum score of an alignment of X and Y

Page 19: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

19/64

Computing Optimal Score • How can we compute the optimal score ?

– If |s| = n and |t| = m, the number A(m,n) of possible “legal” alignments is large!

• we perform dynamic programming to compute the optimal score efficiently.

222( , ) ( , )

n

n

nA m n A n n

n

Stirling’s formula: Exercise 1 2! 2 x xx x e m of the

order of n

Page 20: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink

*

*

*

*

*

* *

* *

*

*

Source

*

Page 21: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink

*

*

*

*

*

* *

* *

*

*

Source

*

Page 22: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Manhattan Tourist Problem: Formulation

Goal: Find the longest (highest scoring) path in a weighted grid.

Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink”

Output: A longest path in G from “source” to “sink”

Page 23: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

MTP: An Example

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate i c

oo

rdin

ate

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4 19

9 5

15

23

0

20

3

4

Page 24: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

MTP: Greedy Algorithm Is Not Optimal 1 2 5

2 1 5

2 3 4

0 0 0

5

3

0

3

5

0

10

3

5

5

1

2 promising start, but leads to bad choices!

source

sink 18

22

Page 25: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

1

5

0 1

0

1

i

source

1

5

S1,0 = 5

S0,1 = 1

• Calculate optimal path score for each vertex in the graph

• Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between

MTP: Dynamic Programming j

Page 26: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

MTP: Dynamic Programming

(cont’d)

1 2

5

3

0 1 2

0

1

2

source

1 3

5

8

4

S2,0 = 8

i

S1,1 = 4

S0,2 = 3 3

-5

j

Page 27: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

MTP: Dynamic Programming

(cont’d)

1 2

5

3

0 1 2 3

0

1

2

3

i

source

1 3

5

8

8

4

0

5

8

10 3

5

-5

9

13

1 -5

S3,0 = 8

S2,1 = 9

S1,2 = 13

S3,0 = 8

j

Page 28: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

MTP: Dynamic Programming (cont’d)

greedy alg. fails!

1 2 5

-5 1 -5

-5 3

0

5

3

0

3

5

0

10

-3

-5

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

9

12

S3,1 = 9

S2,2 = 12

S1,3 = 8

j

Page 29: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

MTP: Dynamic Programming

(cont’d)

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

S3,2 = 9

S2,3 = 15

Page 30: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

MTP: Dynamic Programming

(cont’d)

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

0

1

16 S3,3 = 16

(showing all back-traces)

Done!

Page 31: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

MTP: Recurrence

Computing the score for a point (i,j) by the recurrence relation:

si, j = max

si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)

The running time is n x m for a n by m grid

(n = # of rows, m = # of columns)

Page 32: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

What about diagonals?

• The score at point B is given by:

sB = max of

sA1 + weight of the edge (A1, B)

sA2 + weight of the edge (A2, B)

sA3 + weight of the edge (A3, B)

B

A3

A1

A2

Adding Diagonal Edges to the Grid

Page 33: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

More generally, computing the score for point x is given by the recurrence relation:

sx = max

of

sy + weight of vertex (y, x) where

y є Predecessors(x)

• Predecessors (x) – set of vertices that have edges leading to x

Adding Diagonal Edges to the Grid

Page 34: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Traveling in the Grid •The only hitch is that one must decide on the order in which visit the vertices

•By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble.

•We need to traverse the vertices in some order

•Try to find such order for a directed acyclic grid graph

???

Page 35: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Traversing the Manhattan Grid

• 3 different strategies:

• a) Column by column

• b) Row by row

• c) Along diagonals

a) b)

c)

Page 36: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Comparison methods

• Global alignment – Finds the best alignment across the whole two sequences.

• Local alignment – Finds regions of similarity in parts of the sequences. Global Local

_____ _______ __ ____

__ ____ ____ __ ____

Page 37: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Global Alignment

• Algorithm of Needleman and Wunsch (1970)

• Finds the alignment of two complete sequences: ADLGAVFALCDRYFQ

|||| |||| |

ADLGRTQN-CDRYYQ

• Some global alignment programs “trim ends”

Page 38: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

Local Alignment

• Algorithm of Smith and Waterman (1981).

• Makes an optimal alignment of the best segment of

similarity between two sequences.

ADLG CDRYFQ

|||| |||| |

ADLG CDRYYQ

• Can return a number of highly aligned segments.

Page 39: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

39

Global Alignment: Algorithm

1..j1..i T and S of alignment optimum of Cost),( jiC

T of jlength of Prefix

S of i length of Prefix

..1

..1

j

i

T

S

ba

babaw

if

if),(

Page 40: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

40

)1j,i(C

)j,1i(C

)T,S(w)1j,1i(C

max)j,i(Cji

j)j,0(Ci)0,i(C

Initial conditions:

Recurrence relation: For 1 i n, 1 j m:

Theorem. C(i,j) satisfies the following relationships:

Page 41: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

41

Example

Case 1: Line up Si with Tj

S: C A T T C A C

T: C - T T C A G

i - 1 i

j j -1

S: C A T T C A - C

T: C - T T C A G -

Case 2: Line up Si with space i - 1 i

j

S: C A T T C A C -

T: C - T T C A - G

Case 3: Line up Tj with space i

j j -1

Page 42: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

42

Justification: Optimal Substructure Property Followed

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj-1 Tj

C(i-1,j-1) + w(Si,Tj)

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj —

C(i-1,j)

S1 S2 . . . Si —

T1 T2 . . . Tj-1 Tj

C(i,j-1)

Page 43: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

43

Computation Procedure

C(n,m)

C(0,0)

C(i,j)

)1j,i(C,)j,1i(C),T,S(w)1j,1i(Cmax)j,i(C ji

C(i-1,j) C(i-1,j-1)

C(i,j-1)

Page 44: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

44

λ C T C G C A G C

A

C

T

T

C

A

C

+10 for match, -2 for mismatch, -5 for space

0 -5 -10 -15 -20 -25 -30 -35 -40

-5

-10

-15

-20

-25

-30

-35

10 5

λ

Page 45: Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list • Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology

45

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10

-15 0 15 10 5 0 -5 -2 -7

-20 -5 10 13 8 3 -2 -7 -4

-25 -10 5 20 15 18 13 8 3

-30 -15 0 15 18 13 28 23 18

-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

A

C

T

T

C

A

C

λ

Traceback can yield both optimum alignments

*

*