Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του...

11
Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press

Transcript of Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του...

Page 1: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

Multiple Alignment –

Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University Press

Page 2: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

Three cοmmοn representations

•There are three common kinds of family representations that come from multiple string comparison: ▫Profile representations ▫Consensus sequence representations▫Signature representations.

Page 3: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

Family representations and alignments with profiles

•Definition: Given a multiple alignment of a set of strings, a profile for that multiple alignment specifies for each column the frequency that each character appears in the column. A profile is sometimes also called a weight matrix in the biological literature.

Page 4: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

How to optimally align a string to a profile

•Definition: For a character y and column j, let p(y,j) be the frequency that character y appears in column j of the profile, and let S(x,j) denote the score for aligning x with column j.

•Let V(i,j) denote the value of the optimal alignment of substring S[1..i] with the first j columns of C

, ,ys x y xp y j

Page 5: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

Signature representations οf families

• The major collections of signatures in protein are the ΡROSΙTE database and the BLOCKS database derived from it.

• Helicases are proteins that help unwind double-stranded DNΑ so that the DNA can be read for duplication, transcription, recombination, οr repair.

• Α large fraction of the available information on the structure and possible functions of the helicases has been obtained by computer- assisted comparative analysis of their amino acid sequences. This approach has led to the delineation of motifs and patterns that are conserved in different subsets of the helicases.

Page 6: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

Introduction to computing multiple string alignments

•Definition: Given a set of k>2 strings S={S1,S2,..,Sk}, a local multiple alignment of S is obtained by selecting one substring Si’ from each string and then globally aligning those substrings

iS S

Page 7: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

How to score multiple alignments

• Definition: Given a multiple alignment M, the induced pairwise alignment of two strings Si and Sj is obtained from M by removing all rows except the two rows for Si and Sj. That is, the induced alignment is the multiple alignment M restricted to Si and Sj. Any two opposing spaces in that induced alignment can be removed if desired.

• Definition: The score of an induced pairwise alignment is determined using any chosen scoring scheme for two-string alignment in the standard manner.

Page 8: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

Multiple alignment with the sum-of-pairs (SP) objective function

•Definition: The sum of pairs (SP) score of a multiple alignment M is the sum of the scores of pairwise global alignments induced by M.

•The SΡ alignment problem Compute a global multiple alignment M with minimum sιm-of-pairs score.

Page 9: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

An exact solution to the SP alignment problem

•Definition: Let S1, S2 and S3 denote three strings of lengths n1, n2 and n3, respectively, and let D(i,j,k) be the optimal SP score for aligning S1[1..i], S2[1..j] and S3[1..k]. The score for a match, mismatch, or space is specified by the variables smatch, smis, and sspace, respectively.

Page 10: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

Recurrences fοr a nonbοundary cell(i, j)

For i=1 to n1 do

For j=l to n2 do

For k=l to n3 do

beginif (S1(i) = S2(j)) then cij = smatch

else cij = smis;

if (S1(i) = S3(k)) then cik= smatch

else cik = smis;

if (S2(j) = S3(k)) then cjk= smatch

else ιjk := smis;

d1 = D(i-1, j-1, k-1) + cij + cik + cjk;

d2 = D(i-1, j-1,k) + cij + 2*sspace;

d3 = D(i- 1, j, k- 1) + cik + 2xsspace;

d4 = D(i, j- 1,k-1) + cjk + 2*sspace;

d5 = D(i-1, j, k) + 2*sspace;

d6 = D(i, j- 1, k) + 2*sspace;

d7 = D(i, j, k- 1) + 2*sspace;

D(i, j, k) :: Min[d1, d2, d3, d4, d5, d6, d7];end;

Page 11: Multiple Alignment – Υλικό βασισμένο στο κεφάλαιο 14 του βιβλίου: Dan Gusfield, Algorithms on Strings, Trees and Sequences, Cambridge University

A speedup for the exact solution

• Definition: Let d1,2(i,j) be the edit distance between suffixes S1[l..n] and S2[j..n] of strings S1 and S2. Define d1,3(i,k) and d2,3(j,k) analogously.

• Key idea Recall that D(i, j,k) is the optimal SP score for aligning S1[1..i], S2[1.. j],and S3[1 ..k). If D(i, j, k) + d1,2(i, j) + d1,3(i, k) + d2,3( j, k) is greater than z then node (i, j, k) cannot be on any optimal path and so (in a forward computation) D(i, j, k) need not be sent forward to any cell.