Large-Scale Global Alignments Multiple Alignments Lecture 10, Thursday May 1, 2003.
Rapid Global Alignments
-
Upload
mansour-jalil -
Category
Documents
-
view
37 -
download
1
description
Transcript of Rapid Global Alignments
Rapid Global Alignments
How to align genomic sequences in (more or less) linear time
Methods to CHAIN Local Alignments
Sparse Dynamic ProgrammingO(N log N)
The Problem: Find a Chain of Local Alignments
(x,y) (x’,y’)
requires
x < x’y < y’
Each local alignment has a weight
FIND the chain with highest total weight
Sparse DP for rectangle chaining
• 1,…, N: rectangles
• (hj, lj): y-coordinates of rectangle j
• w(j): weight of rectangle j
• V(j): optimal score of chain ending in j
• L: list of triplets (lj, V(j), j)
L is sorted by lj L is implemented as a balanced binary tree
y
h
l
Sparse DP for rectangle chaining
Main idea:
• Sweep through x-coordinates
• To the right of b, anything chainable to a is chainable to b
• Therefore, if V(b) > V(a), rectangle a is “useless” – remove it
• In L, keep rectangles j sorted with increasing lj-coordinates sorted with increasing V(j)
V(b)V(a)
Sparse DP for rectangle chaining
Go through rectangle x-coordinates, from left to right:
1. When on the leftmost end of rectangle i, compute V(i)
a. j: rectangle in L, with largest lj < hi
b. V(i) = w(i) + V(j)
2. When on the rightmost end of i, possibly store V(i) in L:
a. j: rectangle in L, with largest lj lib. If V(i) > V(j):
i. INSERT (li, V(i), i) in L
ii. REMOVE all (lk, V(k), k) with V(k) V(i) &
lk li
i
j
Example
x
y
1: 5
3: 3
2: 6
4: 45: 2
2
56
91011
1214
1516
Time Analysis
1. Sorting the x-coords takes O(N log N)
2. Going through x-coords: N steps
3. Each of N steps requires O(log N) time:
• Searching L takes log N• Inserting to L takes log N• All deletions are consecutive, so log N per deletion• Each element is deleted at most once: N log N for all deletions
• Recall that INSERT, DELETE, SUCCESSOR, take O(log N) time in a balanced binary search tree
Putting it All Together:Fast Global Alignment Algorithms
1. FIND local alignments
2. CHAIN local alignments
FIND CHAIN
GLASS: k-mers hierarchical DP
MumMer: Suffix Tree sparse DP
Avid: Suffix Tree hierarchical DP
LAGAN CHAOS sparse DP
LAGAN: Pairwise Alignment
1. FIND local alignments
2. CHAIN local alignments
3. DP restricted around chain
LAGAN
1. Find local alignments
2. Chain -O(NlogN) L.I.S.
3. Restricted DP
LAGAN: recursive call
• What if a box is too large? Recursive application of LAGAN,
more sensitive word search
A trick to save on memory
“necks” have tiny tracebacks
…only store tracebacks
Multiple Sequence Multiple Sequence AlignmentsAlignments
Overview
• Definition
• Scoring Schemes
• Algorithms
Definition
• Given N sequences x1, x2,…, xN: Insert gaps (-) in each sequence xi, such that
• All sequences have the same length L
• Score of the global map is maximum
• A faint similarity between two sequences becomes significant if present in many
• Multiple alignments can help improve the pairwise alignments
Scoring Function
• Ideally: Find alignment that maximizes probability that sequences evolved
from common ancestor, according to some phylogenetic model
• More on phylogenetic models later
x
yz
w
v
?
Scoring Function
• A comprehensive model would have too many parameters, too inefficient to optimize
• Possible simplifications
Ignore phylogenetic tree
Statistically independent columns:
S(m) = G(m) + i S(mi)
m: alignment matrixG: function penalizing gaps
Scoring Function: Sum Of Pairs
Definition: Induced pairwise alignment
A pairwise alignment induced by the multiple alignment
Example:
x: AC-GCGG-C y: AC-GC-GAG z: GCCGC-GAG
Induces:
x: ACGCGG-C; x: AC-GCGG-C; y: AC-GCGAGy: ACGC-GAC; z: GCCGC-GAG; z: GCCGCGAG
Sum Of Pairs (cont’d)
• The sum-of-pairs score of an alignment is the sum of the scores of all induced pairwise alignments
S(m) = k<l s(mk, ml)
s(mk, ml): score of induced alignment (k,l)
Sum Of Pairs (cont’d)
• Heuristic way to incorporate evolution tree:
Human
Mouse
Chicken• Weighted SOP:
S(m) = k<l wkl s(mk, ml)
wkl: weight decreasing with distance
Duck
Consensus
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGACCAG-CTATCAC--GACCGC----TCGATTTGCTCGAC
CAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Find optimal consensus string m* to maximize
S(m) = i s(m*, mi)
s(mk, ml): score of pairwise alignment (k,l)
Multiple Sequence Alignments
Algorithms
1. Multidimensional Dynamic Programming
Generalization of Needleman-Wunsh:
S(m) = i S(mi)
(sum of column scores)
F(i1,i2,…,iN) = max(all neighbors of cube)(F(nbr)+S(nbr))
• Example: in 3D (three sequences):
• 7 neighbors/cell
F(i,j,k) = max{ F(i-1,j-1,k-1)+S(xi, xj, xk),F(i-1,j-1,k )+S(xi, xj, - ),F(i-1,j ,k-1)+S(xi, -, xk),F(i-1,j ,k )+S(xi, -, - ),F(i ,j-1,k-1)+S( -, xj, xk),F(i ,j-1,k )+S( -, xj, xk),F(i ,j ,k-1)+S( -, -, xk) }
1. Multidimensional Dynamic Programming
Running Time:
1. Size of matrix: LN;
Where L = length of each sequence
N = number of sequences
2. Neighbors/cell: 2N – 1
Therefore………………………… O(2N LN)
1. Multidimensional Dynamic Programming
2. Progressive Alignment
• Multiple Alignment is NP-complete
• Most used heuristic: Progressive Alignment
Algorithm:1. Align two of the sequences xi, xj
2. Fix that alignment
3. Align a third sequence xk to the alignment xi,xj
4. Repeat until all sequences are aligned
Running Time: O( N L2 )
2. Progressive Alignment
• When evolutionary tree is known:
Align closest first, in the order of the tree
Example:Order of alignments: 1. (x,y)
2. (z,w)3. (xy, zw)
x
w
y
z
CLUSTALW: progressive alignment
CLUSTALW: most popular multiple protein alignment
Algorithm:
1. Find all dij: alignment dist (xi, xj)
2. Construct a tree
(Neighbor-joining hierarchical clustering)
3. Align nodes in order of decreasing similarity
+ a large number of heuristics
CLUSTALW & the CINEMA viewer
MLAGAN: progressive alignment of DNA
Given N sequences, phylogenetic tree
Align pairwise, in order of the tree (LAGAN)
Human
Baboon
Mouse
Rat
MLAGAN: main steps
Given a collection of sequences, and a phylogenetic tree
1. Find local alignments for every pair of sequences x, y
2. Find anchors between every pair of sequences, similar to LAGAN anchoring
3. Progressive alignment• Multi-Anchoring based on reconciling the pairwise anchors• LAGAN-style limited-area DP
4. Optional refinement steps
MLAGAN: multi-anchoring
XZ
YZ
X/Y
Z
To anchor the (X/Y), and (Z) alignments:
Heuristics to improve multiple alignments
• Iterative refinement schemes
• A*-based search
• Consistency
• Simulated Annealing
• …
Iterative Refinement
One problem of progressive alignment:• Initial alignments are “frozen” even when new evidence comes
Example:
x: GAAGTTy: GAC-TT
z: GAACTGw: GTACTG
Frozen!
Now clear correct y = GA-CTT
Iterative Refinement
Algorithm (Barton-Stenberg):
1. Align most similar xi, xj
2. Align xk most similar to (xixj)3. Repeat 2 until (x1…xN) are aligned
4. For j = 1 to N,Remove xj, and realign to x1…xj-1xj+1…xN
5. Repeat 4 until convergence
Note: Guaranteed to converge
Iterative Refinement
For each sequence y1. Remove y2. Realign y
(while rest fixed)x
y
z
x,z fixed projection
allow y to vary
Iterative Refinement
Example: align (x,y), (z,w), (xy, zw):
x: GAAGTTAy: GAC-TTAz: GAACTGAw: GTACTGA
After realigning y:
x: GAAGTTAy: G-ACTTA + 3 matchesz: GAACTGAw: GTACTGA
Iterative Refinement
Example not handled well:
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
Realigning any single yi changes nothing
Restricted MDP
Here is another way to improve a multiple alignment:
1. Construct progressive multiple alignment m
2. Run MDP, restricted to radius R from m
Running Time: O(2N RN-1 L)
Restricted MDP
Run MDP, restricted to radius R from m
x
y
z
Running Time: O(2N RN-1 L)
Restricted MDP
x: GAAGTTAy1: GAC-TTAy2: GAC-TTAy3: GAC-TTA
z: GAACTGAw: GTACTGA
• Within radius 1 of the optimal
Restricted MDP will fix it.
Optional refinement steps in MLAGAN
• Limited-area iterative refinement
• Radius-r 3-sequence refinement on each node of the tree