DNA, RNA and protein are an alien language ... We try to cryptographically attack this language ... we
want to decipher both its meaning and its history …
We do not have to understand the languaje to identify patterns:
“klaatu barada nikto”
Fortunate the genetic code is alphabetic … susceptible to perform string comparisons and
pattern recognition
Pairwise Sequence Alignment
• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties
• Methods of pairwise sequence alignment • window-based methods• dynamic programming approaches
Pairwise Sequence Alignment
A TTCACATA
T A C A T T A C G T A C
Sequence 1
Sequence 2
Pairwise Sequence Alignment: How to?
Dotplot:
A T T C
A C
A T A
T A C A T T A C G T A CSequence 1
Sequence 2
A dotplot gives an overview of all possible alignments
Dotplot:
A T T C
A C
A T A
T A C A T T A C G T A C
T A C A T T A C G T A C
A T A C A C T T A
Sequence 1
Sequence 2
One possible alignment:
In a dotplot each diagonal corresponds to a possible (ungapped) alignment
• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties
• Methods of pairwise sequence alignment • window-based methods• dynamic programming approaches
Pairwise Sequence Alignment
Window-based Approaches
• Word Size
• Window / Stringency
Word Size Algorithm
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
T A C G G T A T G
A C A G T A T C
C T A T G A C A
T A C G G T A T G
Word Size = 3
Window / Stringency
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
T A C G G T A T G
T C A G T A T C
C T A T G A CA
T A C G G T A T G
Window = 5 / Stringency = 4
Considerations
• The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted).
• The smaller the window, the larger the weight of statistical (unspecific) matches.
• With large windows the sensitivity for short sequences is reduced.
• Insertions/deletions are not treated explicitly.
Insertions / Deletions in a Dotplot
T
A
C
T
G
T
C
A
T
T A C T G T T C A TSequence 1
Sequence 2
T A C T G - T C A T| | | | | | | | |T A C T G T T C A T
Hemoglobin -chain
Hemoglobin
-chain
Dotplot (Window = 130 / Stringency = 9)
Output of the programs Compare and DotPlot
Dotplot (Window = 18 / Stringency = 10)
Output of the programs Compare and DotPlot
Hemoglobin
-chain
Hemoglobin -chain
• Principles of pairwise sequence comparison• global / local alignments• scoring systems• gap penalties
• Methods of pairwise sequence alignment • window-based approaches• dynamic programming approaches
• Needleman and Wunsch• Smith and Waterman
Pairwise Sequence Alignment
Automatic procedure that finds the best alignment
with an optimal score depending on the chosen parameters.
Dynamic Programming
Recursive solutions. We solve smaller problems first, and
use those solutions to solve larger problems. Intermediate
solutions are stored in a tabular matrix.
Basic principles of dynamic programming
- Initialization of alignment matrix: the scoring model
- Stepwise calculation of score values
(creation of an alignment path matrix)
- Backtracking (evaluation of the optimal path)
Initialization of Matrix (BLOSUM 50): A distance metric
H E A G A W G H E E
P -2 -1 -1 -2 -1 -4 -2 -2 -1 -1
A -2 -1 5 0 5 -3 0 -2 -1 -1
W -3 -3 -3 -3 -3 15 -3 -3 -3 -3
H 10 0 -2 -2 -2 -3 -2 10 0 0
E 0 6 -1 -3 -1 -3 -3 0 6 6
A -2 -1 5 0 5 -3 0 -2 -1 -1
E 0 6 -1 -3 -1 -3 -3 0 6 6
Needleman and Wunsch(global alignment)
Sequence 1: H E A G A W G H E ESequence 2: P A W H E A E
Scoring parameters: BLOSUM50 matrix
Gap penalty: Linear gap penalty of 8
Creation of an alignment path matrix
Idea:Build up an optimal alignment using previous solutions for
optimal alignments of smaller subsequences
• Construct matrix F indexed by i and j (one index for each sequence)
• F(i,j) is the score of the best alignment between the initial segment x1...i of x up to xi and the initial segment y1...j of y up to yj
• Build F(i,j) recursively beginning with F(0,0) = 0
-A
EE
HHG-WWAA
G-AP
E-H-
Optimal global alignment: EE
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
Creation of an alignment path matrix
HEAGAWGHE-E--P-AW-HEAE
Optimal global alignment:
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = max F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(i-1, j-1) F(i, j-1)
F(i-1,j) F(i, j)
-d
-d
s(xi ,yj)
Creation of an alignment path matrix
HEAGAWGHE-E--P-AW-HEAE
• If F(i-1,j-1), F(i-1,j) and F(i,j-1) are known we can calculate F(i,j)
• Three possibilities:
• xi and yj are aligned, F(i,j) = F(i-1,j-1) + s(xi ,yj)
• xi is aligned to a gap, F(i,j) = F(i-1,j) - d
• yj is aligned to a gap, F(i,j) = F(i,j-1) - d
• The best score up to (i,j) will be the largest of the three options
Creation of an alignment path matrix
H E A G A W G H E E 0
P
A
W
H
E
A
E
-8 -16 -24 -32 -40 -48 -56 -64 -72 -80
-8
-16
-24
-32
-40
-48
-56
F(j, 0) = -j d
Boundary conditions
F(i, 0) = -i d
Creation of an alignment path matrix
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8
A -16
W -24
H -32
E -40
A -48
E -56
Stepwise calculation of score values
-2
-10
-9
-3
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = max F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(0,0) + s(xi ,yj) = 0 -2 = -2
F(1,1) = max F(0,1) - d = -8 -8= -16 = -2
F(1,0) - d = -8 -8= -16
F(1,0) + s(xi ,yj) = -8 -1 = -9
F(2,1) = max F(1,1) - d = -2 -8 = -10 = -9
F(2,0) - d = -16 -8= -24
-8 -2 = -10
F(1,2) = max -16 -8 = -24 = -10
-2 -8 = -10
-2 -1 = -3
F(2,2) = max -10 -8 = -18 = -3
-9 -8 = -17
P-H=-2
E-P=-1
H-A=-2
E-A=-1
H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80
P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73
A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60
W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37
H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19
E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5
A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2
E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1
Backtracking
-5
1
-A
EE
HHG-WWAA
G-AP
E-H-
0
-25
-5
-20
-13
-3
3
-8 -16
-17
Optimal global alignment: EE
Two differences:
1.
2. An alignment can now end anywhere in the matrix
Smith and Waterman(local alignment)
Example:Sequence 1 H E A G A W G H E ESequence 2 P A W H E A E
Scoring parameters: Log-odds ratiosGap penalty: Linear gap penalty of 8
0
F(i, j) = F(i-1, j-1) + s(xi ,yj)
F(i, j) = F(i-1, j) - d
F(i, j) = F(i, j-1) - d
F(i, j) = max
H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 5 0 0 0 0 0
W 0 0 0 0 2 0 20 12 4 0 0
H 0 10 2 0 0 0 12 18 22 14 6
E 0 2 16 8 0 0 4 10 18 28 20
A 0 0 8 21 13 5 0 4 10 20 27
E 0 0 6 13 18 12 4 0 4 16 26
Smith Waterman alignment
Optimal local alignment: AA
G-
EE
HH
WW
28
0
5
20 12
22
Extended Smith & Waterman
To get multiple local alignments:• delete regions around best path
• repeat backtracking
H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 0 0 0 0 0
W 0 0 0 0 2 0 0 0
H 0 10 2 0 0 0
E 0 2 16 8 0 0
A 0 0 8 21 13 5 0
E 0 0 6 13 18 12 4 0
0
5
20 12 4
12 18 22 14 6
4 10 18 28 20
4 10 20 27
4 16 26
Extended Smith & Waterman
H E A G A W G H E E 0 0 0 0 0 0 0 0 0 0 0
P 0 0 0 0 0 0 0 0 0 0
A 0 0 0 5 0 0 0 0 0 0
W 0 0 0 0 2 0 0 0
H 0 10 2 0 0 0
E 0 2 16 8 0 0
A 0 0 8 21 13 5 0
E 0 0 6 13 18 12 4 0
Second best local alignment:
0
21
10
16
HHEEAA
Extended Smith & Waterman
Further Extensions of Dynamic Programming
• Overlap matches
• Alignment with affine gap scores
• Pairwise sequence comparison• global / local alignments• parameters• scoring systems• insertions / deletions
• Methods of pairwise sequence alignment • dotplot• windows-based methods• dynamic programming• algorithm complexity
Pairwise Sequence Alignment
End.of.pa.irwise..sequence | | | | | align.ment.cours.e
Methods of Pairwise Comparison
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
Programs perform global alignments:
• Needleman & Wunsch: (Pileup, Tree, Clustal)
• Word Size Method: (Clustal)
• X. Huang (MAlign) (modified N-W)
1.
Construction of a Guide Tree
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
1 2 3 4 5
1
2
3
4
5
Sequence
Similarity Matrix:
displays scores ofall sequence pairs.
The similarity matrix is transformed into a distance matrix . . . . .
2.
Construction of a Guide Tree
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
DistanceMatrix
1
23
4
5
Guide Tree
Neighbour-Joining Method or
UPGMA (unweighted pair group method of arithmetic averages)
2.
Multiple Alignment
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
1
23
4
5
Guide Tree
2
3.
1
T T A C T T C C A G G
Columns - once aligned - are never changed
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
T T A C T T C C A G G
3.
G T C C G - - C A G G
T T - C G C - C - G G
G T C C G - C A G G
T T - C G C C - G G
T T A C T T C C A G G
Columns - once aligned - are never changed
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step
T T A C T T C C A G G
3.
G T C C G - - C A G G
T T - C G C - C - G G
G T C C G - C A G G
T T - C G C C - G G
. . . . and new gaps are inserted.
T T A C T T C C A G G
Columns - once aligned - are never changed
Multiple AlignmentProgressive Alignment:
step
Progressive Alignment:
step3.
G T C C G - - C A G G
T T - C G C - C - G G
A T C - T - - C A A T
C T G - T C C C T A G
A T C T - - C A A T
C T G T C C C T A G
T T A C T T C C A G G
G T C C G - - C A G G
T T - C G C - C - G G
Sub-sequence alignments
A K-means like clustering problem
Clustering resulting model
Clustering predictions
Assignments
•Describe a pairwise alignment with a different gap penalization.
•Provide an example and perform a multiple global alignment. Describe the recipe.
•Provide an example and and perform a multiple alignment of subsequences. Describe the recipe.
•Algorithms Order (polynomial, exponential, NP)
Algorithmic Complexity
How does an algorithm‘s performance in CPU time and required memory storage scale with the size of the problem?
Needleman & Wunsch
• Storing (n+1)x(m+1) numbers
• Each number costs a constant number of calculations to compute (three sums and a max)
• Algorithm takes O(nm) memory and O(nm) time
• Since n and m are usually comparable: O(n2)
Top Related