Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...

Post on 06-Jun-2020

3 views 0 download

Transcript of Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...

Pairwise Sequence alignment Basic Algorithms

Agenda - Previous Lesson: Minhala

- + Biological Story on Biomolecular Sequences

- + General Overview of Problems in Computational

Biology

Today:

- Reminder: Dynamic Programming

-Algorithms for Global and Local Sequence Alignment + variants

-Bioinformatic Motivation for Sequence Alignment

3

Literature list

• Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology of the Cell.

• Mount, D.W. Bioinformatics: Sequence and Genome Analysis.

• Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms.

• R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.

• Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology.

• Move to Slides On Dynamic Programming…

5/64

Sequence Comparison (cont)

• We seek the following similarities between sequences :

• Find similar proteins – Allows to predict function & structure

• Locate similar subsequences in DNA – Allows to identify (e.g) regulatory elements

• Locate DNA sequences that might overlap – Helps in sequence assembly

g1

g2

Sequence Modifications

• Three types of changes

– Substitution (point mutation)

– Insertion

– Deletion

6

TCAGT TCGAGT

TCCGT

TCGT

TCAGT

Indel (replication slippage)

7/64

Choosing Alignments

There are many possible alignments For example, compare:

-GCGC-ATGGATTGAGCGA

TGCGCCATTGAT-GACC-A

to ------GCGCATGGATTGAGCGA

TGCGCC----ATTGATGACCA--

Which one is better?

8/64

Another example

Given two sequences:

X: TGCATAT

Y: ATCCGAT

Question:

How can X be transformed into Y?

Or,

How did Y evolve from X?

9/64

TGCATAT

TGCATA

TGCAT

ATGCAT

ATCCAT

ATCCGAT

delete T

delete A

insert A

G C

insert G

One possible transformation

Alignment:

-TGC-ATAT

ATCCGAT--

5 operations

10/64

-TGCATAT

ATCCG-AT

TGCATAT

ATGCATAT

ATGCAAT

ATGCGAT

ATCCGAT

insert A

delete T

A G

Another possible transformation

Alignment:

4 operations

G C

Which one is better?

In order to align two sequences we need a quantitive model to evaluate similarity between sequences.

11

How do we quantitate sequence similarity?

Scoring Similarity

• Assume independent mutation model

– Each site considered separately

• Score at each site

– Positive if the same

– Negative if different

• Sum to make final score

– Can be positive or negative

– Significance depends on sequence length

12

GTAGTC

CTAGCG

Pairwise Alignment - Identity

(HH) VLSPADKTNVKAAWGKVGAHAGYEG

||| | | || | |

(SWM) VLSEGEWQLVLHVWAKVEADVAGHG

• Percent Identity: 36.000 (| only)

Human Hemoglobin (HH) vs Sperm Whale Myoglobin (SWM):

Pairwise Alignment - Similarity

(HH) VLSPADKTNVKAAWGKVGAHAGYEG

||| . | | || | |

(SWM) VLSEGEWQLVLHVWAKVEADVAGHG

• Percent Similarity: 40.000 (| and .)

• Percent Identity: 36.000 (| only)

D and E are similar:

1. structure is similar.

2. both are acidic and hydrophilic

3. one mutation can separate them

from one to the other.

Pairwise Alignment – Gap insertion

(HH) VLSPADKTNVKAAWGKVGAH-AGYEG

.

(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G

• Gaps: 2

• Percent Similarity: 54.167

• Percent Identity: 45.833 (12/26)

Pairwise Alignment - Scoring

• The final score of the alignment is the sum of the positive scores and penalty scores:

+ Number of Identities

+ Number of Similarities

- Number of gap insertions

Alignment score

Pairwise Alignment - Scoring

(HH) VLSPADKTNVKAAWGKVGAH-AGYEG

||| . | | || || |

(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G

Final score:

(V,V) + (L,L) + (S,S) + (D,E) + … - (penalty for gap insertion)*(number of gaps) - (penalty for gap extension)*(extension length)

We are interested in both the score and the alignment trace.

18

Optimum Alignment

The score of an alignment is a measure of its

quality

Optimum alignment problem: Given a pair of

sequences X and Y, find an alignment (global or

local) with maximum score

The similarity between X and Y, denoted

sim(X,Y), is the maximum score of an alignment of X and Y

19/64

Computing Optimal Score • How can we compute the optimal score ?

– If |s| = n and |t| = m, the number A(m,n) of possible “legal” alignments is large!

• we perform dynamic programming to compute the optimal score efficiently.

222( , ) ( , )

n

n

nA m n A n n

n

Stirling’s formula: Exercise 1 2! 2 x xx x e m of the

order of n

Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink

*

*

*

*

*

* *

* *

*

*

Source

*

Manhattan Tourist Problem (MTP)

Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink

*

*

*

*

*

* *

* *

*

*

Source

*

Manhattan Tourist Problem: Formulation

Goal: Find the longest (highest scoring) path in a weighted grid.

Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink”

Output: A longest path in G from “source” to “sink”

MTP: An Example

3 2 4

0 7 3

3 3 0

1 3 2

4

4

5

6

4

6

5

5

8

2

2

5

0 1 2 3

0

1

2

3

j coordinate i c

oo

rdin

ate

13

source

sink

4

3 2 4 0

1 0 2 4 3

3

1

1

2

2

2

4 19

9 5

15

23

0

20

3

4

MTP: Greedy Algorithm Is Not Optimal 1 2 5

2 1 5

2 3 4

0 0 0

5

3

0

3

5

0

10

3

5

5

1

2 promising start, but leads to bad choices!

source

sink 18

22

1

5

0 1

0

1

i

source

1

5

S1,0 = 5

S0,1 = 1

• Calculate optimal path score for each vertex in the graph

• Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between

MTP: Dynamic Programming j

MTP: Dynamic Programming

(cont’d)

1 2

5

3

0 1 2

0

1

2

source

1 3

5

8

4

S2,0 = 8

i

S1,1 = 4

S0,2 = 3 3

-5

j

MTP: Dynamic Programming

(cont’d)

1 2

5

3

0 1 2 3

0

1

2

3

i

source

1 3

5

8

8

4

0

5

8

10 3

5

-5

9

13

1 -5

S3,0 = 8

S2,1 = 9

S1,2 = 13

S3,0 = 8

j

MTP: Dynamic Programming (cont’d)

greedy alg. fails!

1 2 5

-5 1 -5

-5 3

0

5

3

0

3

5

0

10

-3

-5

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

9

12

S3,1 = 9

S2,2 = 12

S1,3 = 8

j

MTP: Dynamic Programming

(cont’d)

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

S3,2 = 9

S2,3 = 15

MTP: Dynamic Programming

(cont’d)

1 2 5

-5 1 -5

-5 3 3

0 0

5

3

0

3

5

0

10

-3

-5

-5

2

0 1 2 3

0

1

2

3

i

source

1 3 8

5

8

8

4

9

13 8

12

9

15

9

j

0

1

16 S3,3 = 16

(showing all back-traces)

Done!

MTP: Recurrence

Computing the score for a point (i,j) by the recurrence relation:

si, j = max

si-1, j + weight of the edge between (i-1, j) and (i, j)

si, j-1 + weight of the edge between (i, j-1) and (i, j)

The running time is n x m for a n by m grid

(n = # of rows, m = # of columns)

What about diagonals?

• The score at point B is given by:

sB = max of

sA1 + weight of the edge (A1, B)

sA2 + weight of the edge (A2, B)

sA3 + weight of the edge (A3, B)

B

A3

A1

A2

Adding Diagonal Edges to the Grid

More generally, computing the score for point x is given by the recurrence relation:

sx = max

of

sy + weight of vertex (y, x) where

y є Predecessors(x)

• Predecessors (x) – set of vertices that have edges leading to x

Adding Diagonal Edges to the Grid

Traveling in the Grid •The only hitch is that one must decide on the order in which visit the vertices

•By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble.

•We need to traverse the vertices in some order

•Try to find such order for a directed acyclic grid graph

???

Traversing the Manhattan Grid

• 3 different strategies:

• a) Column by column

• b) Row by row

• c) Along diagonals

a) b)

c)

Comparison methods

• Global alignment – Finds the best alignment across the whole two sequences.

• Local alignment – Finds regions of similarity in parts of the sequences. Global Local

_____ _______ __ ____

__ ____ ____ __ ____

Global Alignment

• Algorithm of Needleman and Wunsch (1970)

• Finds the alignment of two complete sequences: ADLGAVFALCDRYFQ

|||| |||| |

ADLGRTQN-CDRYYQ

• Some global alignment programs “trim ends”

Local Alignment

• Algorithm of Smith and Waterman (1981).

• Makes an optimal alignment of the best segment of

similarity between two sequences.

ADLG CDRYFQ

|||| |||| |

ADLG CDRYYQ

• Can return a number of highly aligned segments.

39

Global Alignment: Algorithm

1..j1..i T and S of alignment optimum of Cost),( jiC

T of jlength of Prefix

S of i length of Prefix

..1

..1

j

i

T

S

ba

babaw

if

if),(

40

)1j,i(C

)j,1i(C

)T,S(w)1j,1i(C

max)j,i(Cji

j)j,0(Ci)0,i(C

Initial conditions:

Recurrence relation: For 1 i n, 1 j m:

Theorem. C(i,j) satisfies the following relationships:

41

Example

Case 1: Line up Si with Tj

S: C A T T C A C

T: C - T T C A G

i - 1 i

j j -1

S: C A T T C A - C

T: C - T T C A G -

Case 2: Line up Si with space i - 1 i

j

S: C A T T C A C -

T: C - T T C A - G

Case 3: Line up Tj with space i

j j -1

42

Justification: Optimal Substructure Property Followed

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj-1 Tj

C(i-1,j-1) + w(Si,Tj)

S1 S2 . . . Si-1 Si

T1 T2 . . . Tj —

C(i-1,j)

S1 S2 . . . Si —

T1 T2 . . . Tj-1 Tj

C(i,j-1)

43

Computation Procedure

C(n,m)

C(0,0)

C(i,j)

)1j,i(C,)j,1i(C),T,S(w)1j,1i(Cmax)j,i(C ji

C(i-1,j) C(i-1,j-1)

C(i,j-1)

44

λ C T C G C A G C

A

C

T

T

C

A

C

+10 for match, -2 for mismatch, -5 for space

0 -5 -10 -15 -20 -25 -30 -35 -40

-5

-10

-15

-20

-25

-30

-35

10 5

λ

45

0 -5 -10 -15 -20 -25 -30 -35 -40

-5 10 5 0 -5 -10 -15 -20 -25

-10 5 8 3 -2 -7 0 -5 -10

-15 0 15 10 5 0 -5 -2 -7

-20 -5 10 13 8 3 -2 -7 -4

-25 -10 5 20 15 18 13 8 3

-30 -15 0 15 18 13 28 23 18

-35 -20 -5 10 13 28 23 26 33

λ C T C G C A G C

A

C

T

T

C

A

C

λ

Traceback can yield both optimum alignments

*

*