Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...
Transcript of Pairwise Sequence alignment Basic Algorithmsbccg131/wiki.files/2 pariwise...3 Literature list •...
Pairwise Sequence alignment Basic Algorithms
Agenda - Previous Lesson: Minhala
- + Biological Story on Biomolecular Sequences
- + General Overview of Problems in Computational
Biology
Today:
- Reminder: Dynamic Programming
-Algorithms for Global and Local Sequence Alignment + variants
-Bioinformatic Motivation for Sequence Alignment
3
Literature list
• Alberts, B et al. Essential Cell Biology: An introducton to the Molecular Biology of the Cell.
• Mount, D.W. Bioinformatics: Sequence and Genome Analysis.
• Jones N.C & Pevzner P.A. An introduction to Bioinformatics algorithms.
• R. Durbin, S. Eddy, A. Krogh, and G. Mitchison. Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids.
• Dan Gusfield. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology.
• Move to Slides On Dynamic Programming…
5/64
Sequence Comparison (cont)
• We seek the following similarities between sequences :
• Find similar proteins – Allows to predict function & structure
• Locate similar subsequences in DNA – Allows to identify (e.g) regulatory elements
• Locate DNA sequences that might overlap – Helps in sequence assembly
g1
g2
Sequence Modifications
• Three types of changes
– Substitution (point mutation)
– Insertion
– Deletion
6
TCAGT TCGAGT
TCCGT
TCGT
TCAGT
Indel (replication slippage)
7/64
Choosing Alignments
There are many possible alignments For example, compare:
-GCGC-ATGGATTGAGCGA
TGCGCCATTGAT-GACC-A
to ------GCGCATGGATTGAGCGA
TGCGCC----ATTGATGACCA--
Which one is better?
8/64
Another example
Given two sequences:
X: TGCATAT
Y: ATCCGAT
Question:
How can X be transformed into Y?
Or,
How did Y evolve from X?
9/64
TGCATAT
TGCATA
TGCAT
ATGCAT
ATCCAT
ATCCGAT
delete T
delete A
insert A
G C
insert G
One possible transformation
Alignment:
-TGC-ATAT
ATCCGAT--
5 operations
10/64
-TGCATAT
ATCCG-AT
TGCATAT
ATGCATAT
ATGCAAT
ATGCGAT
ATCCGAT
insert A
delete T
A G
Another possible transformation
Alignment:
4 operations
G C
Which one is better?
In order to align two sequences we need a quantitive model to evaluate similarity between sequences.
11
How do we quantitate sequence similarity?
Scoring Similarity
• Assume independent mutation model
– Each site considered separately
• Score at each site
– Positive if the same
– Negative if different
• Sum to make final score
– Can be positive or negative
– Significance depends on sequence length
12
GTAGTC
CTAGCG
Pairwise Alignment - Identity
(HH) VLSPADKTNVKAAWGKVGAHAGYEG
||| | | || | |
(SWM) VLSEGEWQLVLHVWAKVEADVAGHG
• Percent Identity: 36.000 (| only)
Human Hemoglobin (HH) vs Sperm Whale Myoglobin (SWM):
Pairwise Alignment - Similarity
(HH) VLSPADKTNVKAAWGKVGAHAGYEG
||| . | | || | |
(SWM) VLSEGEWQLVLHVWAKVEADVAGHG
• Percent Similarity: 40.000 (| and .)
• Percent Identity: 36.000 (| only)
D and E are similar:
1. structure is similar.
2. both are acidic and hydrophilic
3. one mutation can separate them
from one to the other.
Pairwise Alignment – Gap insertion
(HH) VLSPADKTNVKAAWGKVGAH-AGYEG
.
(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
• Gaps: 2
• Percent Similarity: 54.167
• Percent Identity: 45.833 (12/26)
Pairwise Alignment - Scoring
• The final score of the alignment is the sum of the positive scores and penalty scores:
+ Number of Identities
+ Number of Similarities
- Number of gap insertions
Alignment score
Pairwise Alignment - Scoring
(HH) VLSPADKTNVKAAWGKVGAH-AGYEG
||| . | | || || |
(SWM) VLSEGEWQLVLHVWAKVEADVAGH-G
Final score:
(V,V) + (L,L) + (S,S) + (D,E) + … - (penalty for gap insertion)*(number of gaps) - (penalty for gap extension)*(extension length)
We are interested in both the score and the alignment trace.
18
Optimum Alignment
The score of an alignment is a measure of its
quality
Optimum alignment problem: Given a pair of
sequences X and Y, find an alignment (global or
local) with maximum score
The similarity between X and Y, denoted
sim(X,Y), is the maximum score of an alignment of X and Y
19/64
Computing Optimal Score • How can we compute the optimal score ?
– If |s| = n and |t| = m, the number A(m,n) of possible “legal” alignments is large!
• we perform dynamic programming to compute the optimal score efficiently.
222( , ) ( , )
n
n
nA m n A n n
n
Stirling’s formula: Exercise 1 2! 2 x xx x e m of the
order of n
Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink
*
*
*
*
*
* *
* *
*
*
Source
*
Manhattan Tourist Problem (MTP)
Imagine seeking a path (from source to sink) to travel (only eastward and southward) with the most number of attractions (*) in the Manhattan grid Sink
*
*
*
*
*
* *
* *
*
*
Source
*
Manhattan Tourist Problem: Formulation
Goal: Find the longest (highest scoring) path in a weighted grid.
Input: A weighted grid G with two distinct vertices, one labeled “source” and the other labeled “sink”
Output: A longest path in G from “source” to “sink”
MTP: An Example
3 2 4
0 7 3
3 3 0
1 3 2
4
4
5
6
4
6
5
5
8
2
2
5
0 1 2 3
0
1
2
3
j coordinate i c
oo
rdin
ate
13
source
sink
4
3 2 4 0
1 0 2 4 3
3
1
1
2
2
2
4 19
9 5
15
23
0
20
3
4
MTP: Greedy Algorithm Is Not Optimal 1 2 5
2 1 5
2 3 4
0 0 0
5
3
0
3
5
0
10
3
5
5
1
2 promising start, but leads to bad choices!
source
sink 18
22
1
5
0 1
0
1
i
source
1
5
S1,0 = 5
S0,1 = 1
• Calculate optimal path score for each vertex in the graph
• Each vertex’s score is the maximum of the prior vertices score plus the weight of the respective edge in between
MTP: Dynamic Programming j
MTP: Dynamic Programming
(cont’d)
1 2
5
3
0 1 2
0
1
2
source
1 3
5
8
4
S2,0 = 8
i
S1,1 = 4
S0,2 = 3 3
-5
j
MTP: Dynamic Programming
(cont’d)
1 2
5
3
0 1 2 3
0
1
2
3
i
source
1 3
5
8
8
4
0
5
8
10 3
5
-5
9
13
1 -5
S3,0 = 8
S2,1 = 9
S1,2 = 13
S3,0 = 8
j
MTP: Dynamic Programming (cont’d)
greedy alg. fails!
1 2 5
-5 1 -5
-5 3
0
5
3
0
3
5
0
10
-3
-5
0 1 2 3
0
1
2
3
i
source
1 3 8
5
8
8
4
9
13 8
9
12
S3,1 = 9
S2,2 = 12
S1,3 = 8
j
MTP: Dynamic Programming
(cont’d)
1 2 5
-5 1 -5
-5 3 3
0 0
5
3
0
3
5
0
10
-3
-5
-5
2
0 1 2 3
0
1
2
3
i
source
1 3 8
5
8
8
4
9
13 8
12
9
15
9
j
S3,2 = 9
S2,3 = 15
MTP: Dynamic Programming
(cont’d)
1 2 5
-5 1 -5
-5 3 3
0 0
5
3
0
3
5
0
10
-3
-5
-5
2
0 1 2 3
0
1
2
3
i
source
1 3 8
5
8
8
4
9
13 8
12
9
15
9
j
0
1
16 S3,3 = 16
(showing all back-traces)
Done!
MTP: Recurrence
Computing the score for a point (i,j) by the recurrence relation:
si, j = max
si-1, j + weight of the edge between (i-1, j) and (i, j)
si, j-1 + weight of the edge between (i, j-1) and (i, j)
The running time is n x m for a n by m grid
(n = # of rows, m = # of columns)
What about diagonals?
• The score at point B is given by:
sB = max of
sA1 + weight of the edge (A1, B)
sA2 + weight of the edge (A2, B)
sA3 + weight of the edge (A3, B)
B
A3
A1
A2
Adding Diagonal Edges to the Grid
More generally, computing the score for point x is given by the recurrence relation:
sx = max
of
sy + weight of vertex (y, x) where
y є Predecessors(x)
• Predecessors (x) – set of vertices that have edges leading to x
Adding Diagonal Edges to the Grid
Traveling in the Grid •The only hitch is that one must decide on the order in which visit the vertices
•By the time the vertex x is analyzed, the values sy for all its predecessors y should be computed – otherwise we are in trouble.
•We need to traverse the vertices in some order
•Try to find such order for a directed acyclic grid graph
???
Traversing the Manhattan Grid
• 3 different strategies:
• a) Column by column
• b) Row by row
• c) Along diagonals
a) b)
c)
Comparison methods
• Global alignment – Finds the best alignment across the whole two sequences.
• Local alignment – Finds regions of similarity in parts of the sequences. Global Local
_____ _______ __ ____
__ ____ ____ __ ____
Global Alignment
• Algorithm of Needleman and Wunsch (1970)
• Finds the alignment of two complete sequences: ADLGAVFALCDRYFQ
|||| |||| |
ADLGRTQN-CDRYYQ
• Some global alignment programs “trim ends”
Local Alignment
• Algorithm of Smith and Waterman (1981).
• Makes an optimal alignment of the best segment of
similarity between two sequences.
ADLG CDRYFQ
|||| |||| |
ADLG CDRYYQ
• Can return a number of highly aligned segments.
39
Global Alignment: Algorithm
1..j1..i T and S of alignment optimum of Cost),( jiC
T of jlength of Prefix
S of i length of Prefix
..1
..1
j
i
T
S
ba
babaw
if
if),(
40
)1j,i(C
)j,1i(C
)T,S(w)1j,1i(C
max)j,i(Cji
j)j,0(Ci)0,i(C
Initial conditions:
Recurrence relation: For 1 i n, 1 j m:
Theorem. C(i,j) satisfies the following relationships:
41
Example
Case 1: Line up Si with Tj
S: C A T T C A C
T: C - T T C A G
i - 1 i
j j -1
S: C A T T C A - C
T: C - T T C A G -
Case 2: Line up Si with space i - 1 i
j
S: C A T T C A C -
T: C - T T C A - G
Case 3: Line up Tj with space i
j j -1
42
Justification: Optimal Substructure Property Followed
S1 S2 . . . Si-1 Si
T1 T2 . . . Tj-1 Tj
C(i-1,j-1) + w(Si,Tj)
S1 S2 . . . Si-1 Si
T1 T2 . . . Tj —
C(i-1,j)
S1 S2 . . . Si —
T1 T2 . . . Tj-1 Tj
C(i,j-1)
43
Computation Procedure
C(n,m)
C(0,0)
C(i,j)
)1j,i(C,)j,1i(C),T,S(w)1j,1i(Cmax)j,i(C ji
C(i-1,j) C(i-1,j-1)
C(i,j-1)
44
λ C T C G C A G C
A
C
T
T
C
A
C
+10 for match, -2 for mismatch, -5 for space
0 -5 -10 -15 -20 -25 -30 -35 -40
-5
-10
-15
-20
-25
-30
-35
10 5
λ
45
0 -5 -10 -15 -20 -25 -30 -35 -40
-5 10 5 0 -5 -10 -15 -20 -25
-10 5 8 3 -2 -7 0 -5 -10
-15 0 15 10 5 0 -5 -2 -7
-20 -5 10 13 8 3 -2 -7 -4
-25 -10 5 20 15 18 13 8 3
-30 -15 0 15 18 13 28 23 18
-35 -20 -5 10 13 28 23 26 33
λ C T C G C A G C
A
C
T
T
C
A
C
λ
Traceback can yield both optimum alignments
*
*