Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in...
-
date post
21-Dec-2015 -
Category
Documents
-
view
214 -
download
0
Transcript of Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in...
![Page 1: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/1.jpg)
Aligning Alignments ExactlyAligning Alignments Exactly
By John Kececioglu, Dean StarrettBy John Kececioglu, Dean StarrettCS Dept. Univ. of ArizonaCS Dept. Univ. of Arizona
Appeared in 8Appeared in 8thth ACM RECOME 2004, ACM RECOME 2004,
Presented by Jie MengPresented by Jie Meng
![Page 2: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/2.jpg)
BackgroundBackground DefinitionDefinition HardnessHardness An Exponential time algorithmAn Exponential time algorithm
![Page 3: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/3.jpg)
AlignmentsAlignments
Given two (DNA or Protein) sequences, an Given two (DNA or Protein) sequences, an alignment puts them against each other alignment puts them against each other such that the similar parts are aligned as such that the similar parts are aligned as close as possible, for example:close as possible, for example:
A T – C – T C G C TA T – C – T C G C T- T G - A T G – A T- T G - A T G – A T
There are four kinds of alignments
Match
Insertion;
Deletion;
Mismatch
![Page 4: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/4.jpg)
Scoring AlignmentsScoring Alignments
There are four types of aligned columns:There are four types of aligned columns:– Match – Score Match – Score matchmatch = 0. = 0.
– Mismatch – Score Mismatch – Score mismatchmismatch 0. 0.
– Insertion – Score Insertion – Score insertioninsertion 0. 0.
– DeletionDeletion – Score – Score deletiondeletion 0. 0.
The The scorescore of an alignment is defined to be the of an alignment is defined to be the sumsum of the score of the aligned columns. of the score of the aligned columns.
The goal is to minimize the scoreThe goal is to minimize the score
![Page 5: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/5.jpg)
Gap-costGap-cost
We can extend the score We can extend the score indel indel by by openopen and and extensionextension, then for a gap of size x, we have , then for a gap of size x, we have openopen +x* +x* extensionextension instead of x* instead of x* indel indel ..
AT----CGCTTCAT AT----CGCTTCAT -TGCAT—AT----- -TGCAT—AT-----
openopen +4* +4* extensionextension
![Page 6: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/6.jpg)
Multiple AlignmentsMultiple Alignments
In general we also need compare In general we also need compare multiplemultiple sequences and find the similarities.sequences and find the similarities.
Multiple alignmentMultiple alignment generalizes the generalizes the alignment idea to handle many alignment idea to handle many sequences.sequences.
AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT
![Page 7: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/7.jpg)
Sum-of-Pairs (SP) ScoreSum-of-Pairs (SP) Score
Given a multiple alignment, the Given a multiple alignment, the sum-of-pairssum-of-pairs (SP) (SP) score is given by the sum of the score is given by the sum of the inducedinduced pairwise pairwise alignment scores of each pair in the alignment.alignment scores of each pair in the alignment.
AT-C-TCGATAT-C-TCGAT -TGCAT--AT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT
AT-C-TCGAT -TGCAT--AT AT-C-TCGATAT-C-TCGAT -TGCAT--AT AT-C-TCGAT
-TGCAT--AT ATCCA-CGCT ATCCA-CGCT -TGCAT--AT ATCCA-CGCT ATCCA-CGCT
+ +
![Page 8: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/8.jpg)
BAD NEWSBAD NEWS
Multiple alignment is NP-hardMultiple alignment is NP-hard
One methods is to approximate the One methods is to approximate the optimal value; optimal value;
Progressive alignments Progressive alignments
A problem arised natually: A problem arised natually: Aligning AlignmentsAligning Alignments
![Page 9: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/9.jpg)
Aligning Alignments
Let S be a collection of strings s1, s2, s3…sk, over alphabet ;
An alignment of S is a matrix A with k rows such that:i) Each entry is either a letter or a space;ii) No column is all space;iii) Reading across row i and remove space, we get string si;
Like before, we have three types of aligning score:match, mismatch and substitution;
![Page 10: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/10.jpg)
Aligning Alignments
Given two alignments A with k sequences of length N, B with l sequences of length M, we want to align the columns of A and B;
AT-C-TCGAT-TGCAT--ATATCCA-CGAT
CT-ATTGGAT-TTAT-G--TCTTA-GGGAT
![Page 11: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/11.jpg)
Aligning Alignments
In other word, We treat the columns of A and B as single letters, just like aligning two sequences.
CTGT-T
AT-TGT
C-TG-T--T
-AT--T-GT
![Page 12: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/12.jpg)
Aligning Alignments
The score function is still sum-of-pair, namely
We note that the alignment of Ai’ and Bj’ may contain space in both sequences, so we just remove the space here
Ai’: a----aa-a
Bj’: aaa-a-a-a
ki lj
ji BAD1 1
'' ),(
![Page 13: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/13.jpg)
Aligning Alignments
Without gap cost, aligning alignments is polynomial time solvable. We can apply dynamic programming like we did in aligning sequences; the only difference here is that we align columns.
![Page 14: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/14.jpg)
Aligning Alignments
With gap cost, this problem is NP-complete We can use a reduction from MAX-CUT problem MAX-CUT: Given a graph G=(V, E), and a integer
c, ask whether there is a partition of V: V= L R and , such that the size of the cut is no less than c;
By cut, it means the set of edges which have one end vertex in L and another is in R;
RL
![Page 15: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/15.jpg)
NP-hardnessNP-hardness
• Given an instance of MAX-CUT G=(V,E), V={v1, v2, …vn} and E={e1, e2, … em},and a integer c;
• we construct two multiple alignments A and B over alphabet {0,1}: both A and B has m edge rows and k dummy rows, each edge rows corresponding an edge; A has 2n columns, every two continuous columns correspond a vertex; B has 3n columns, every three continuous columns correspond a vertex;
![Page 16: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/16.jpg)
NP-hardnessNP-hardness
• The dummy rows in A are (0-)n, dummy rows in B are (0--)n;
• As to the edge rows in A: suppose the row for e, and e=(vi, vj), then in columns i and j, there are substring, “-1”, and space elsewhere;
• As to the edge rows in B: suppose the row for e, and e=(vi, vj), (i<j), then in columns i, there is a substring “010”, in columns j, there is a substring “-10”
![Page 17: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/17.jpg)
NP-hardnessNP-hardness
• Simply we let score for match is 0,
score for mismatch is 1,
and gap open cost is 2, gap extension cost is 1
ask whether there is an alignment such that the score is less then d-c;
So we have an instance of Aligning Alignments.
![Page 18: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/18.jpg)
HOMEWORK4HOMEWORK4
• Given a set of multiple alignments {A1, A2, … An}, each Ai is a multiple alignment with ki sequences, without gap cost, is the problem of multiple alignment on those alignments {A1, A2, … An} hard or easy, use the method in this paper to align multiple alignments, i.e. align columns. If hard, prove it; otherwise, give an efficient algorithm and prove complexity and correctness.
![Page 19: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/19.jpg)
Exact Algorithm
The basic idea is still dynamic programming; We have to remember extra information by a set,
so-called shape, S : for each row in a multiple alignment, we record the columns of the right-most letters.
![Page 20: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/20.jpg)
Exact Algorithm
S(i, j)=
B[j])(A[i],1)-j1,-S(i
B[j])(-,1)-jS(i, (A[i],-)j)1,-S(i
0j and 0i }{
0jor 0i {}
![Page 21: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/21.jpg)
Exact Algorithm
C(i,j,t)=min
Where g(A[i], B[j], s) means the total number of gaps initiated by appending column A[i] and B[j] onto an alignment that ends in shape s;
}]),[],,[()],[],[(s)1,-j1,-{C(i min
|}][|*)],[,(s)1,-j{C(i, min
|}][|*),],[(s)j,1,-{C(i min
open
tB[j])(A[i],s&1)-j1,-S(is
extensionopen
tB[j])(-,s&1)-jS(i,s
extensionopen
t(A[i],-)s&j)1,-S(is
BqAp
jqBipADsjBiAg
jBksjBg
iAlsiAg
![Page 22: Aligning Alignments Exactly By John Kececioglu, Dean Starrett CS Dept. Univ. of Arizona Appeared in 8 th ACM RECOME 2004, Presented by Jie Meng.](https://reader030.fdocuments.in/reader030/viewer/2022032522/56649d635503460f94a45e30/html5/thumbnails/22.jpg)
Exact Algorithm
The optimum value is
The problem here is the number of shapes maybe too many, so in the worst case the time and space complexity is
)},,({],[
snmCMinnmSs
nk ,)23((
nk ,)()23((
2
12
3
22
nk
kkn
n
k