Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau

Sparse Normalized Sparse Normalized Local AlignmentLocal Alignment

Nadav EfratyNadav EfratyGad M. LandauGad M. Landau

BBEEDDAABBAACC

0000000000000000

CC0000000000000011

BB0011111111111111

AA0011111122222222

DD0011112222222222

AA0011112233333333

CC0011112233333344

BB0011112233444444

DD0011112233444444

Background - Global similarityBackground - Global similarityLCS- Longest Common Subsequence

(1977) Hirschberg - Algorithms for the longest (1977) Hirschberg - Algorithms for the longest common subsequence problem.common subsequence problem.

(1977) Hunt, Szymanski - A fast algorithm for (1977) Hunt, Szymanski - A fast algorithm for computing longest common subsequence.computing longest common subsequence.

(1987) Apostolico, Guerra – The longest common (1987) Apostolico, Guerra – The longest common subsequence problem revisited.subsequence problem revisited.

(1992) Eppstein, Galil, Giancarlo, Italiano - Sparse (1992) Eppstein, Galil, Giancarlo, Italiano - Sparse dynamic programming I: linear cost functions.dynamic programming I: linear cost functions.

Background - LCS milestonesBackground - LCS milestones

Global alignment algorithms compute the similarity Global alignment algorithms compute the similarity grade of the entire input strings, by computing the grade of the entire input strings, by computing the best path from the first to the last entry of the table. best path from the first to the last entry of the table.

Background - Global Vs. LocalBackground - Global Vs. Local

Local alignment algorithms report the most similar Local alignment algorithms report the most similar substring pair according to their scoring scheme.substring pair according to their scoring scheme.

T(i,0)=T(0,j)=0 ,for all i,j (1 ≤ i ≤ m ; 1 ≤ j ≤ n)T(i,j)=max{0, T(i-1,j-1)+ S(Yi,Xj) , T(i-1,j)+ D(Yi) , T(i,j-1)+ I(Xj)}

AAEECCEEDDAABBBBCC00000000000000000000

AA00110.60.60.20.20000110.60.60.20.200

BB000.60.60.70.70.30.300000.60.6221.61.61.21.2

CC000.20.20.30.31.71.71.31.30.90.90.50.51.61.61.71.72.62.6

DD0000001.31.31.41.42.32.31.91.91.51.51.31.32.22.2

Background – Background – Smith Waterman Smith Waterman algorithmalgorithm (1981)(1981)

DD(Y(Yii)) = II(X(Xjj)) = -0.4, SS(Y(Yii,X,Xjj)) =

AAEECCEEDDAABBBBCC00000000000000000000

AA00110.60.60.20.20000110.60.60.20.200

BB000.60.60.70.70.30.300000.60.6221.61.61.21.2

CC000.20.20.30.31.71.71.31.30.90.90.50.51.61.61.71.72.62.6

DD0000001.31.31.41.42.32.31.91.91.51.51.31.32.22.2

{ 1 if Y1 if Yii = X = Xjj

-0.3 if Y-0.3 if Yii ≠ X ≠ Xjj

Maximal score vs. maximal degree of similarity.What would reflect higher similarity level?71(score)/10,000(symbols) or 70/200Mosaic effect - Lack of ability to discard poorly conserved intermediate segments.Shadow effect - Short alignments may not be detected because they are overlapped by longer alignments.

70/10000

40/100The sparsity of the essential data is not exploited.

BEDABAC

00000000

C00000001

B01111111

A01112222

D01122222

A01123333

C01123334

B01123444

D01123444

40 31-30

40 10 41

The weaknesses of the Smith Waterman algorithm(according to Arslan, Eğecioğlu and Pevzner):

This cannot be fixed by post processingThis cannot be fixed by post processing

Background – The Background – The Smith Waterman Smith Waterman algorithmalgorithm

The statistical significance of the local alignment The statistical significance of the local alignment depends on both its score and length.depends on both its score and length.

Thus, the solution for these weaknesses is: Thus, the solution for these weaknesses is: NormalizationNormalization

Instead of maximizing S(X’,Y’), maximize Instead of maximizing S(X’,Y’), maximize S(X’,Y’)/(|X’|+|Y’|).S(X’,Y’)/(|X’|+|Y’|).

Under that scoring scheme, one match is always Under that scoring scheme, one match is always an optimal alignment. Thus, a minimal length or a an optimal alignment. Thus, a minimal length or a minimal score constraint is needed.minimal score constraint is needed.

Background – Normalized local alignmentBackground – Normalized local alignment

The algorithm of Arslan, Eğecioğlu and Pevzner The algorithm of Arslan, Eğecioğlu and Pevzner (2001) (2001) convergeconverge to the optimal normalized to the optimal normalized alignment value through iterations of the Smith alignment value through iterations of the Smith Waterman algorithm.Waterman algorithm.

They solve the problem SCORE(X’,Y’)/(|X’|+|Y’|+L), They solve the problem SCORE(X’,Y’)/(|X’|+|Y’|+L), where L is a constant that controls the amount of where L is a constant that controls the amount of normalization. normalization.

The ratio between L and |X’|+|Y’| determines the The ratio between L and |X’|+|Y’| determines the influence of L on the value of the alignment.influence of L on the value of the alignment.

The time complexity of their algorithm is O(nThe time complexity of their algorithm is O(n22loglogn).n).

Background – Normalized sequence alignmentBackground – Normalized sequence alignment

Maximize LCS(X’,Y’)/(|X’|+|Y’|). Maximize LCS(X’,Y’)/(|X’|+|Y’|). It can be viewed as measure of the density of the It can be viewed as measure of the density of the

matches.matches.

A minimal length or score constraint, M, must be A minimal length or score constraint, M, must be enforced, and we chose the score constraint (the enforced, and we chose the score constraint (the value of LCS(X’,Y’)) value of LCS(X’,Y’)) The value of M is problem related.The value of M is problem related.

Our approachOur approach

The naïve O(The naïve O(rLrLloglogloglognn) ) normalized local normalized local LCSLCS

algorithmalgorithm

DefinitionsDefinitions

A A chainchain is a sequence of matches that is is a sequence of matches that is strictly increasing in both components.strictly increasing in both components.

The length of a chain from match (i,j) to The length of a chain from match (i,j) to match (i’,j’) is i’-i+j’-j.match (i’,j’) is i’-i+j’-j.

00

n

m

Y

X

00

J’ n

i

i’

m

J

Y

X

)i,j(

)i’,j(’

A A kk--chainchain(i,j)(i,j) is the shortest chain of k is the shortest chain of k matches starting from (i,j).matches starting from (i,j).

The normalized value of The normalized value of kk--chainchain(i,j)(i,j) is k is k divided by its length.divided by its length.

The naïve algorithmThe naïve algorithmFor each match (i,j), construct For each match (i,j), construct kk--chainchain(i,j)(i,j) for 1≤k≤L for 1≤k≤L (L=(L=LCSLCS(X,Y)).(X,Y)). Computing the best chains starting from each match Computing the best chains starting from each match

guarantees that the optimal chain will not be missed.guarantees that the optimal chain will not be missed.

Examine all the Examine all the k-chains k-chains with k≥M of all matches with k≥M of all matches and report either:and report either: The The k-chains k-chains with the highest normalized value.with the highest normalized value. k-chains k-chains whose normalized value exceeds a whose normalized value exceeds a

predefined threshold. predefined threshold.

ProblemProblem: : kk--chainchain(i,j)(i,j) is not necessarily the prefix is not necessarily the prefix of of (k+1)(k+1)--chainchain(i,j)(i,j)..

a b c a d e c f h c

gbfhecgggfdef

SolutionSolution: construct : construct (k+1)(k+1)--chainchain(i,j)(i,j) by by concatenating concatenating (i,j)(i,j) to to kk--chainchain(i’,j’) (i’,j’) ..

a b c a d e c f h c

gbfhecgggfdef

QuestionQuestion: How to find the proper : How to find the proper kk--chainchain(i’,j’)(i’,j’)??

What If there are two candidates (What If there are two candidates ((i,j)(i,j) is in the is in the mutual range of two matches)?mutual range of two matches)?

If there is only one candidate (If there is only one candidate ((i,j)(i,j) is in the is in the range of a single match range of a single match (i’,j’))(i’,j’)),, it is clearit is clear..

00

n

i

m

J

00

n

m

Lemma: Lemma: A mutual range of two matches isA mutual range of two matches isowned completely by one of themowned completely by one of them..

00

m

Y

Xn

We use the lemma in order to maintain We use the lemma in order to maintain L L datadatastructures. In the structures. In the kk data structure: data structure:

All the matches are the heads of k-chains.All the matches are the heads of k-chains.Each match owns the range to its left.Each match owns the range to its left.

Computing (k+1)-chain of a match is done byComputing (k+1)-chain of a match is done byconcatenating it to the owner of the range it is in. concatenating it to the owner of the range it is in.

Row i

Row 0

Preprocessing: Preprocessing: create the list of matches of each row.create the list of matches of each row.

Process the matches row by row, from bottom up. Process the matches row by row, from bottom up. For the matches of row i:For the matches of row i: Stage 1Stage 1: Construct k-chains 1≤k≤L.: Construct k-chains 1≤k≤L. Stage 2Stage 2: Update the data structures with the : Update the data structures with the

matches of row i and their k-chains. They will matches of row i and their k-chains. They will be used for the computation of next rows.be used for the computation of next rows.

Examine all k-chains of all matches and report the Examine all k-chains of all matches and report the ones with the highest normalized value.ones with the highest normalized value.

The algorithmThe algorithm

Complexity analysisComplexity analysisPreprocessingPreprocessing- O(- O(nnloglogΣΣYY).).

Stage 1-Stage 1- For each of the For each of the rr matches we construct at most matches we construct at most LL k-chains, with total complexity of O(k-chains, with total complexity of O(rLrLloglogloglognn), when ), when

Johnson Trees are used by our data structures.Johnson Trees are used by our data structures.

Stage 2-Stage 2- Each of the Each of the rr matches is inserted and extracted at matches is inserted and extracted at

most once to each of the data structures, and the total most once to each of the data structures, and the total complexity is again O(complexity is again O(rLrLloglogloglognn).).

Complexity analysisComplexity analysisChecking all k-chains of all matches and reporting Checking all k-chains of all matches and reporting the best alignments consumes O(the best alignments consumes O(rLrL) time.) time.

Total time complexity of this algorithm isTotal time complexity of this algorithm is O(O(nnloglogΣΣYY + + rLrLloglogloglognn))..

Space complexity is O(Space complexity is O(rLrL++nLnL). ). r matches with (at most) r matches with (at most) LL records each. records each. The space of The space of LL Johnson Trees of size n Johnson Trees of size n..

The O(The O(rMrMloglogloglognn) ) normalized local normalized local LCSLCS

algorithmalgorithm

The O(The O(rMrMloglogloglognn) normalized ) normalized local local LCSLCS algorithm algorithm

The algorithm reports the best possible localThe algorithm reports the best possible localalignment (value and substrings).alignment (value and substrings).

This section is divided to:This section is divided to:1.1. Computing the highest normalized value.Computing the highest normalized value.2.2. Constructing the longest optimal alignment.Constructing the longest optimal alignment.

Computing the highest normalized valueComputing the highest normalized value

Definition: A Definition: A sub-chainsub-chain of a of a k-Chaink-Chain is a path that is a path that contains a sequence of x ≤ k consecutive matchescontains a sequence of x ≤ k consecutive matchesof the k-Chain. of the k-Chain. It does not have to start or end at a match. It does not have to start or end at a match.

a b c a d e c f h c

gbfhecg

Computing the highest normalized valueComputing the highest normalized value

ClaimClaim: When a k-chain is split into a number : When a k-chain is split into a number of non overlapping consecutive sub-chains,of non overlapping consecutive sub-chains,the value of the k-chain is at most equal to the value of the k-chain is at most equal to the value of the the value of the best sub-chainbest sub-chain. .

10 3 + 2 +3+ 240 14 + 5 +12+ 9 =

10 5 + 2 + 3 + 140 20 + 8 + 12 + 4 =

Computing the highest normalized valueComputing the highest normalized valueResultResult: :

Any k-chain with k≥M may be split to non Any k-chain with k≥M may be split to non overlapping consecutive sub-chains of M overlapping consecutive sub-chains of M matches, followed by a last sub-chain of up matches, followed by a last sub-chain of up to 2M-1 matches. to 2M-1 matches. The normalized value of the The normalized value of the best sub-chainbest sub-chain will be at least equal to that of the k-chain.will be at least equal to that of the k-chain.

10 3 + 3+ 440 12 + 14+ 14 =

Assume M = 3.

=10-chain

Computing the highest normalized valueComputing the highest normalized valueA sub-chains of less than M matches may not be A sub-chains of less than M matches may not be reported. reported.

Sub-chains of 2M matches or more, can be split Sub-chains of 2M matches or more, can be split into shorter sub-chains of M to 2M-1 matches.into shorter sub-chains of M to 2M-1 matches.

QuestionQuestion: Is it sufficient to construct all the sub-: Is it sufficient to construct all the sub-chains of chains of exactlyexactly M matches? M matches?

4/10 Vs. 3/8

1

12345

2 3 4 5

Computing the highest normalized valueComputing the highest normalized valueThe algorithmThe algorithm: For each match construct all : For each match construct all the k-chains, for k≤2M-1. the k-chains, for k≤2M-1.

The algorithm constructs all these chains, that The algorithm constructs all these chains, that are, in fact, the sub-chains of all the longer k-are, in fact, the sub-chains of all the longer k-chains.chains.A longer chain cannot be better than its best sub-A longer chain cannot be better than its best sub-chain. chain.

This algorithm reports the highest normalized This algorithm reports the highest normalized value of a value of a sub-chainsub-chain which is equal to the highest which is equal to the highest normalized value of a normalized value of a chainchain..

Constructing the longest optimal alignmentConstructing the longest optimal alignment

Definition: A perfect alignment is an alignment of twoDefinition: A perfect alignment is an alignment of twoidentical strings. Its normalized value is ½.identical strings. Its normalized value is ½.

LemmaLemma: unless the optimal alignment is perfect, the: unless the optimal alignment is perfect, thelongest optimal alignment has no more than 2M-1longest optimal alignment has no more than 2M-1matches.matches.

Constructing the longest optimal alignmentConstructing the longest optimal alignmentProofProof: Assume there is a chain with more than 2M-1 : Assume there is a chain with more than 2M-1

matches whose normalized value is the optimal, matches whose normalized value is the optimal, denoted by denoted by LOLO..LOLO maymay be split to a number of sub-chains of M be split to a number of sub-chains of M matches, followed by a single sub-chain of between matches, followed by a single sub-chain of between M and 2M-1 matches.M and 2M-1 matches.The normalized value of each such sub-chain must The normalized value of each such sub-chain must be equal to that of be equal to that of LOLO,, otherwise, otherwise, LOLO is not optimal. is not optimal.Each such sub-chain must start and end at a Each such sub-chain must start and end at a match, otherwise, the normalized value of the chain match, otherwise, the normalized value of the chain comprised of the same matches will be higher than comprised of the same matches will be higher than that of that of LOLO..

10/30 0/20/3=10/35 < 10/30

Constructing the longest optimal alignmentConstructing the longest optimal alignment

It’s number of matches is M+1 and its length is S+2.It’s number of matches is M+1 and its length is S+2.

Since < , < . Thus, we found a chain of M+1 Since < , < . Thus, we found a chain of M+1 matches whose normalized value is higher than that matches whose normalized value is higher than that of of LOLO, in contradiction to the optimality of , in contradiction to the optimality of LOLO. .

M 1 M M + 1M 1 M M + 1 S 2 S S + 2S 2 S S + 2

The tails and heads of the sub-chains from which The tails and heads of the sub-chains from which LOLO is comprised must be next to each other. is comprised must be next to each other.

M/S M/S 2M/2S

Closing Closing remarksremarks

The advantages of the new algorithmThe advantages of the new algorithm

Ideal for textual local comparison as well as Ideal for textual local comparison as well as for screening bio sequences.for screening bio sequences.

Normalized and thus, does not suffer from Normalized and thus, does not suffer from the shadow and mosaic effects. the shadow and mosaic effects.

A straight forward approach to the minimal A straight forward approach to the minimal constraint.constraint.


the minimal constraint is problem related the minimal constraint is problem related rather than input related. rather than input related.

If we refer to it as a constant, the If we refer to it as a constant, the complexity of the algorithm is O(complexity of the algorithm is O(rrloglogloglognn).).

Since for textual comparison we can expect Since for textual comparison we can expect r<<nr<<n22, the complexity may be even better , the complexity may be even better than that of the non normalized local than that of the non normalized local similarity algorithms.similarity algorithms.


The O(The O(rMrMloglogloglognn) algorithm computes the ) algorithm computes the optimal normalized alignments.optimal normalized alignments.

The advantage of the O(The advantage of the O(rLrLloglogloglognn) ) algorithm is that it can report all the long algorithm is that it can report all the long alignment that exceed a predefined value alignment that exceed a predefined value and not only the short optimal alignments.and not only the short optimal alignments.

QuestionsQuestions

Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau

Documents

Transcript of Sparse Normalized Local Alignment Nadav Efraty Gad M. Landau