CS 5263 Bioinformatics
description
Transcript of CS 5263 Bioinformatics
![Page 1: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/1.jpg)
CS 5263 Bioinformatics
Lecture 8: Multiple Sequence Alignment
![Page 2: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/2.jpg)
Roadmap
• Homework?
• Review of last lecture
• Multiple sequence alignment
![Page 3: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/3.jpg)
Homework
• #1: dsDNA => mRNA => protein
Coding strand
Template strand
mRNA
Template strand
mRNA
protein
The genetic code
![Page 4: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/4.jpg)
Problem #2
• For two strings of lengths m and n, the number of alignment is equal to the number of paths from (0, 0) to (m, n)– How many ways we can get to (i, j) depend on how
many ways we can get to its preceding neighbors
![Page 5: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/5.jpg)
Problem #3
• Similar to problem #2• But there are some limitations on certain paths
– (i-1, j-1)→(i-1, j)→(i, j) is illegal– So is (i-1, j-1)→(i, j-1)→(i, j)
• How many ways to get to (i-1, j) without using (i-1, j-1)→(i-1, j)?• How many ways to get to (i, j-1) without using (i-1, j-1)→(i, j-1)?
![Page 6: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/6.jpg)
Problem #4
• Implementation is easy• Histogram: how you bin it may affect your results
– bin for each discrete value you observed in your scores
• Scores related to base frequency?• Scores differ between global and local
alignments?• Score distribution?
![Page 7: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/7.jpg)
BLAST
Main idea: Construct a dictionary of all the words in the queryAlignment initiated between words of alignment score T
Alignment:Ungapped extensions until score below statistical threshold
Output:All local alignments with score
> statistical threshold
……
……
query
DB
query
scan
![Page 8: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/8.jpg)
BLASTA C G A A G T A A G G T C C A G T
C
C
C
T
T
C C
T
G
G
A T
T
G
C
G
A
Example:
k = 4, T = 4
The matching word GGTC initiates an alignment
Extension to the left and right with no gaps until alignment falls < 50%
Output:
GTAAGGTCC
GTTAGGTCC
![Page 9: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/9.jpg)
Gapped BLASTA C G A A G T A A G G T C C A G T
C
T
G
A
T
C C
T
G
G
A
T
T
G C
G
AAdded features:
• Pairs of words can initiate alignment
• Extensions with gaps in a band around anchor
Output:
GTAAGGTCCAGTGTTAGGTC-AGT
![Page 10: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/10.jpg)
• Advantages– Fast!!!!– A few minute to search a database of 1011
bases
• Disadvantages– Sensitivity may be low– Often misses weak homologies
![Page 11: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/11.jpg)
New improvement
• Make it even faster– But even less sensitive– Mainly for aligning very similar sequences or
really long sequences • E.g. whole genome vs whole genome
• Make it more sensitive– PSI-BLAST: iteratively add more homologous
sequences– PatternHunter: discontinuous seeds
![Page 12: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/12.jpg)
Things we’ve covered so far
• Global alignment– Needleman-Wunsch and variants– Improvement on space and time
• Local Alignment– Smith-Waterman
• Heuristic algorithms– BLAST families
• Statistics for sequence alignment– Extreme value distribution
![Page 13: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/13.jpg)
Commonality
• They all deal with aligning two sequences– Pair-wise sequence alignment
![Page 14: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/14.jpg)
Today
• Aligning multiple sequences all together– Multiple sequence alignment
![Page 15: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/15.jpg)
![Page 16: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/16.jpg)
Motivation
• A faint similarity between two sequences becomes very significant if present in many
• Protein domains
• Motifs responsible for gene regulation
![Page 17: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/17.jpg)
Definition
• Given N sequences x1, x2,…, xN:– Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L• Score of the global map is maximum
• Pairwise alignment: a hypothesis on the evolutionary relationship between the letters of two sequences
• Same for a multiple alignment!
![Page 18: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/18.jpg)
Scoring Function
• Ideally:– Find alignment that maximizes probability that
sequences evolved from common ancestor
x
yz
w
v
? Phylogenetic tree
or evolution tree
![Page 19: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/19.jpg)
Scoring Function (cont’d)
• Unfortunately: too many parameters
• Compromises:– Ignore phylogenetic tree
• Compute from pair-wise scores– Based on sum of all pair-wise scores– Based on scores with a consensus sequence
![Page 20: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/20.jpg)
First assumption
• Columns are independent– Similar in pair-wise alignment
• Therefore, the score of an alignment is the sum of all columns
• Need to decide how to score a single column
![Page 21: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/21.jpg)
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example: x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
![Page 22: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/22.jpg)
Sum Of Pairs (cont’d)
• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments
S(m) = k<l s(mk, ml)
s(mk, ml): score of induced alignment (k,l)
![Page 23: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/23.jpg)
Example:
x: AC-GCGG-C
y: AC-GC-GAGz: GCCGC-GAG
A C G T -
A 1 -1 -1 -1 -1
C -1 1 -1 -1 -1
G -1 -1 1 -1 -1
T -1 -1 -1 1 -1
- -1 -1 -1 -1 0
(A,A) + (A,G) x 2 = -1
(C,C) x 3 = 3
(-,A) x 2 + (A,A) = -1
Total score = (-1) + 3 + (-2) + 3 + 3 + (-2) + 3 + (-1) + (-1) = 5
![Page 24: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/24.jpg)
Sum Of Pairs (cont’d)• Drawback: no evolutionary characterization
– Every sequence derived from all others• Heuristic way to incorporate evolution tree
– Weighted Sum of Pairs:
Human
Mouse
Chicken
S(m) = k<l wkl s(mk, ml)
wkl: weight decreasing with distance
Duck
![Page 25: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/25.jpg)
Consensus score
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC
CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Find optimal consensus string m* to maximize
S(m) = i s(m*, mi)
s(mk, ml): score of pairwise alignment (k,l)
Consensus sequence:
![Page 26: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/26.jpg)
Multiple Sequence Alignments
Algorithms
![Page 27: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/27.jpg)
Multidimensional Dynamic Programming (MDP)
Generalization of Needleman-Wunsh:• Find the longest path in a high-dimensional cube
– As opposed to a two-dimensional grid
• Uses a N-dimensional matrix – As apposed to a two-dimensional array
• Entry F(i1, …, ik) represents score of optimal alignment for s1[1..i1], … sk[1..ik]
F(i1,i2,…,iN) = max(all neighbors of a cell) (F(nbr)+S(current))
![Page 28: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/28.jpg)
• Example: in 3D (three sequences):
• 23 – 1 = 7 neighbors/cell
F(i-1,j-1,k-1) + S(xi, xj, xk),
F(i-1,j-1,k ) + S(xi, xj, -),
F(i-1,j ,k-1) + S(xi, -, xk),
F(i,j,k) = max F(i ,j-1,k-1) + S(-, xj, xk),
F(i-1,j ,k ) + S(xi, -, -),
F(i ,j-1,k ) + S(-, xj, -),
F(i ,j ,k-1) + S(-, -, xk)
Multidimensional Dynamic Programming (MDP)
(i,j,k)
(i,j,k-1)
(i-1,j,k-1)(i-1,j-1,k-1)
(i-1,j-1,k)
(i,j-1,k)
(i-1,j,k)
(i,j-1,k-1)
![Page 29: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/29.jpg)
Multidimensional Dynamic Programming (MDP)
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
![Page 30: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/30.jpg)
Faster MDP
• Carrillo & Lipman, 1988– Branch and bound– Other heuristics
• Practical for about 6 sequences of length about 200-300.
![Page 31: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/31.jpg)
Progressive Alignment
• Multiple Alignment is NP-hard• Most used heuristic: Progressive Alignment
Algorithm:1. Align two of the sequences xi, xj
2. Fix that alignment
3. Align a third sequence xk to the alignment xi,xj
4. Repeat until all sequences are aligned
Running Time: O(NL2)Each alignment takes O(L2)
Repeat N times
![Page 32: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/32.jpg)
Progressive Alignment
• When evolutionary tree is known:– Align closest first, in the order of the tree
Example:Order of alignments: 1. (x,y)
2. (z,w)3. (xy, zw)
x
w
y
z
![Page 33: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/33.jpg)
Progressive Alignment: CLUSTALW
CLUSTALW: most popular multiple protein alignment
Algorithm:1. Find all dij: alignment dist (xi, xj)
• High alignment score => short distance
2. Construct a tree
(Neighbor-joining hierarchical clustering. Will discuss in future)
3. Align nodes in order of decreasing similarity
+ a large number of heuristics
![Page 34: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/34.jpg)
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSD
![Page 35: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/35.jpg)
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0 Distance matrix
![Page 36: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/36.jpg)
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0
s1
s3
s2
s4
![Page 37: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/37.jpg)
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0
s1
s3
s2
s4
-ALSKNA-SK
![Page 38: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/38.jpg)
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0
s1
s3
s2
s4
-ALSKNA-SK
-TNSDNT-SD
![Page 39: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/39.jpg)
CLUSTALW example
• S1 ALSK
• S2 TNSD
• S3 NASK
• S4 NTSDs1 s2 s3 s4
s1 0 9 4 7
s2 0 8 3
s3 0 7
s4 0
s1
s3
s2
s4
-ALSKNA-SK
-TNSDNT-SD
-ALSK-TNSDNA-SKNT-SD
Question: how do you align two alignments?
![Page 40: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/40.jpg)
Aligning two alignments
• You can treat each column in an alignment as a single letter– Remember in the case of gene finder, we aligned
three nucleic acids at a time
• How do we score it?– Naïve: compute Sum of Pair
• Better: only compute the cross terms– We already have (K, K) and (D, D)– Need to add 2x(K, D)
-ALSKNA-SK
-TNSDNT-SD
![Page 41: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/41.jpg)
CLUSTALW & the CINEMA viewer
![Page 42: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/42.jpg)
Problems with progressive alignment:• Depend on pair-wise alignments• If sequences are very distantly related, much higher likelihood of
errors• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Iterative Refinement
Frozen!
Now clear: correct y should be GA-CTT
![Page 43: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/43.jpg)
Iterative Refinement
Algorithm (Barton-Stenberg):
1. Align most similar xi, xj
2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned4. For j = 1 to N,
Remove xj, and realign to x1…xj-1xj+1…xN
5. Repeat 4 until convergence
Note: Guaranteed to convergeRunning time: O(kNL2), k: number of iterations
![Page 44: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/44.jpg)
Iterative Refinement (cont’d)
For each sequence y1. Remove y2. Realign y
(while rest fixed)
x
y
z
x,z fixed projection
allow y to vary
![Page 45: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/45.jpg)
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
![Page 46: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/46.jpg)
Iterative Refinement
• Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
![Page 47: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/47.jpg)
Restricted MDP
• Similar to bounded DP in pair-wise alignment1. Construct progressive multiple alignment m
2. Run MDP, restricted to radius R from m
Running Time: O(2N RN-1 L)
x
y
z
![Page 48: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/48.jpg)
Restricted MDP
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
• Within radius 1 of the optimal
Restricted MDP will fix it.
![Page 49: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/49.jpg)
Other approaches
• Profile Hidden Markov Models– Statistical learning methods– Will discuss in future
![Page 50: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/50.jpg)
Multiple alignment tools
• Clustal W (Thompson, 1994)– Most popular
• PRRP (Gotoh, 1993)• HMMT (Eddy, 1995)• DIALIGN (Morgenstern, 1998)• T-Coffee (Notredame, 2000)• MUSCLE (Edgar, 2004)• Align-m (Walle, 2004)• PROBCONS (Do, 2004)
![Page 51: CS 5263 Bioinformatics](https://reader034.fdocuments.in/reader034/viewer/2022051620/5681443d550346895db0d988/html5/thumbnails/51.jpg)
In summary
• Multiple alignment algorithms:– MDP (too slow)
• B&B doesn’t solve the problem entirely
– Progressive alignment: clustalW– Iterative refinement– Restricted MDP