Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Edit Distance

Γιώργος ΠιερράκοςADS – NTUA

4 Ιουνίου 2007

Overview

What is edit distance? Why do we care?A few interesting propertiesSome algorithms from the past for

finding exact edit distanceMore (complicated) algorithms from

today for approximate edit distanceEmbedding edit distance into l1

What is that?

Suppose we have two strings x,y

e.g. x = kitten

y = sitting

and we want to transform x into y.

We use edit operations: 1. insertions

2. deletions

3. substitutions

What is that?

A closer look:

k i t t e n

s i t t i n g

1st step: kitten sitten (substitution)

2nd step: sittensittin (substitution)

3rd step: sittinsitting (insertion)

What is that?

Can we do better?

Answer here is no (obviously)

What about:

x = darladidirladada

y = marmelladara

Tough…

Why do we care?

A lot of applications depend on the similarity of two strings

Computational Biology:

…ATGCATACGATCGATT……TGCAATGGCTTAGCTA…

Animal species from the same family are bound to have more similar DNAs

What about evolutionary biology?

Why do we care?

searching keywords through the net: usually by “mtallica” we mean “metallica”:

Definitions

We care about bit-strings only: Σ = {0,1}n

for i..j<n we denote the substring of x: x[i..j] by xi we denote the i-th bit of x

Associate operations with string positions:

deleting xi ↔ isubstituting xi ↔ iinserting y ↔ position of y after the insertion

Alignment τ of x,y is a sequence of edit operations transforming x into y

Definitions

Length of an alignment is the number of its edit operations

Optimal alignment is one that uses a minimum number of edit operations

Edit Distance of two strings x,y is the length of their optimal alignment: ED(x,y)e.g. ED(kitten, sitting) = 3

Hamming Distance of two equal length strings x,y is the number of positions for which the corresponding symbols are different (xi ≠ yi)e.g. HD(kitten, sittin) = 2

Some interesting properties

1. Triangle Inequality: for any three strings x,y,z of arbitrary lengths:

ED(x,y) ≤ ED(x,z) + ED(z,y)

2. Splitting Inequality: let x, y be strings of lengths n and m respectively. For all i,j:

ED(x,y) ≤ ED(x[1..i],y[1..j])+ED(x[i+1..n],y[j+1..m])


3. Some simple upper and lower bounds: Let x, y be strings of lengths n, m (n ≤ m). Then: ED(x,y) ≤ m ED(x,y) ≥ m-n ED(x,y)=0 iff x=y if m=n, ED(x,y) ≤ HD(x,y) ED(x,y) ≥ number of characters (not

counting duplicates) found in x but not in y


4. Induced alignment property: insτ(i..j) = number of insertions in interval [i..j] delτ(i..j) = number of deletions in interval [i..j] subτ(i..j) = number of sub/tions in interval [i..j]

Caution: i,j refer to the initial positions of the string (see the string as an indexed array)

shτ(i..j) = insτ(i..j) - delτ(i..j)Intuitively shτ(i..j) is the induced shift on x[i..j]Define shτ(i) = shτ(1..i) and shτ(0) = 0

edτ(i..j) is the induced alignment of τ, i.e. the sub-sequence of edit operations in τ that belong in [i..j]


Consider the strings: x = savvato, y = eviva

edτ(4..5)=1 because we insert “i” in pos. 4

τ1 τ2 τ3

x savvatosav-vato

sav-vato

y eviva-- -eviva-- e-viva--

edτ(4..5) 1 1 1


Induced alignment property: For any alignment τ of x and y, for all i ≤ j,

edτ(i..j) ≥ ED (x[i..j], y[i+shτ(i-1)..j+ shτ(j)])

τ1 τ2 τ3

shτ(3), shτ(5) 0,0 -1,0 -1,0

x[4..5] va va va

y[4+shτ(3)..5+shτ(5)]) va iva iva

edτ(4..5) 1 1 1

ED(x[4..5], y[4+shτ(3)..5+shτ(5)])) 0 1 1

A Dynamic Programming Algorithm

Levenshtein ~1965 (Levenshtein distance) induced alignment property principle of optimality

(not exactly) use dynamic programming to solve the problem (quite

similar to subset sum problem) Key ideas:

input: strings s, t with lengths n, m respectively use an (n+1)x(m+1) matrix the invariant maintained throughout the algorithm is that we

can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations.

use a traceback algorithm to find the optimal alignment


int LevenshteinDistance(char s[1..n], char t[1..m])

int d[0..n, 0..m]

for i from 0 to n d[i, 0] := i

for j from 1 to m d[0, j] := j

for i from 1 to n

for j from 1 to m

if s[i] = t[j] then cost := 0 else cost := 1

d[i, j] := minimum(

d[i-1, j] + 1, // deletion

d[i, j-1] + 1, // insertion

d[i-1, j-1] + cost // substitution )

return d[n, m]


branches correspond to different optimal alignments

the length of the optimal alignment is 5

the algorithm runs in O(n2) time and it can be improved so as to use O(n) space


the above algorithm returns the exact value of the edit distance of two strings

there exists a (minor) improvement in the algorithm so as to run in time O(n2/logn) by Masek and Paterson (1980): only theoretical interest

O(n2) is too much! We want linear time. Even developing subquadratic time algorithms

for approximating edit distance within a modest factor has proved quite challenging

Approximating edit distance efficiently

Suppose we know that our strings are either too similar or too dissimilar, namely that their edit distance is either at most k, or greater than l(n,k). Then we can develop a gap algorithm, that decides which of the two holds.

We are going to present three linear algorithms (Yossef, Jayram et. al. FOCS 2004 ): The first one has l= O((kn)2/3) & works for all strings The second one has l= O(k2) & works only for strings

that are non-repetitive The third one is an n3/7-approximation algorithm

which improves to n1/3 if the strings are non-repetitive


Why do we care about an efficient gap algorithm?

1. Such algorithms yield approximation algorithms that are as efficient, with the approximation ratio directly correlated with the gap

2. They are useful on their own: there exist problems where two strings can be either too similar or too dissimilar


Our model for the first two algorithms: the sketching model: two-party public-coin simultaneous messages communication

complexity protocol persons: Alice, Bob and the Referee goal: to jointly compute f: AxB→C, when Alice has the input

and Bob has the input Alice uses her input a and shared random coins to compute a

sketch sA(a) and then sends it to the referee. Bob does the same and sends a sketch sB(b)

the referee uses the sketches and the shared coins to compute the value of the function f(a,b), with a small error probability (constant)

main measure of complexity: the sketches’ size usually desirable that Alice and Bob are efficient too

a A b B


Why the sketching model?Because there already exist sketching algorithms for the Hamming Distance

Idea 1:map the Edit Distance into some form of Hamming Distance and then use the known results for the Hamming Distance

Idea 2:if two strings share a lot of identical substrings in near positions they cannot differ too much.

Gap algorithm for arbitrary strings

Algorithm for arbitrary strings Technique:

1. Map each string to the multiset of all its (overlapping) substrings. Annotate each substring with an “encoding” of its position in the original string.

2. Take the characteristic vectors of the two multisets and run a gap algorithm for their Hamming Distances.

3. Map the results for HD into results for ED


1st step:two inputs: the size of the input n

the gap parameter kdefine a suitable substring length B(n,k)

and a suitable window parameter D(n,k)map: x→Tx and y→Ty. These sets consist

of pairs (γ,i), where γ is a length B substring and i is the “encoding” of the substrings position


Encoding method: round down the starting position of the substring to the nearest multiple of D(n,k)

Example:Let x = savvato, B=2 and D=3.Then

Tx = {(sa,0),(av,0),(vv,1),(va,1),(at,1),(to,2)} formally:

Tx = {(x[i..i+B-1],i div D) | i = 1,…n-B+1} B = n2/3/(2k1/3) and D = n/B


2nd step: Take the characteristic vectors u, v of multisets Tx and

Ty respectively. To do that impose a (lexicographical) order on the elements ofExample:If Tx = {(γ1 , i1 ), (γ2 , i2 ), (γ1 , i3 )} and Ty = {(γ2 , i1 ), (γ1 , i3 ), (γ4 , i5 )} then

= {(γ1 , i1 ),(γ2 , i2 ),(γ1 , i3 ),(γ2 , i1 ),(γ4 , i5 )} order: (γ1 , i1 ), (γ1 , i3 ), (γ2 , i1 ), (γ2 , i2 ), (γ4 , i5 )Then u=11010 and v=01101

But we do not want to sort things (it costs…). So instead of the union consider the set of all (γj , ij )

x yT T

x yT T


a pair is indicative of substrings of x and y that match (i.e. they are identical in terms of contents and appear at nearby positions in x and y) and corresponds to a j such that uj = vj

a pair is indicative of substrings that do not match and corresponds to a j such that uj ≠ vj

We now have to estimate the Hamming Distance between u and v

( , ) x yi T T

( , ) ( \ ) ( \ )x y y xi T T T T


Hamming Distance can be approximated using constant-size sketches as shown by Kushilevitz, Ostrovsky, Rabani (2000)

Problem: the Hamming Distance instance our problem is reduced to is exponentially long (set of all (γj , ij )). As a result the time to produce the sketches is prohibitive (Alice and Bob are very slow…)

Solution: new improved sketching method which 1. produces constant size sketches 2. runs in time proportional to the HD of the instances3. solves the k vs (1+ε)k gap HD problem for any ε>0

Notice that HD(u,v) ≤ O(n): no more than 2n distinct substrings


3rd step: We tune the sketching algorithm for HD to

determine whether HD(u,v) ≤ 4kB or HD(u,v) ≥ 8kB with constant probability of error. The referee, upon receiving the sketches from Alice and Bob, decides that ED(x,y) ≤ k if he finds HD(u,v) ≤ 4kB and ED(x,y) ≥ 13 (kn)2/3 if he finds HD(u,v) ≥ 8kB.

The algorithm’s correctness follows from: Lemma 1: If ED(x,y) ≤ k then HD(u,v) ≤ 4kB Lemma 2: If ED(x,y) ≥ 13 (kn)2/3 then HD(u,v) ≥ 8kB


set map(string x)set T:= emptyfor i=0 to n div D

for j=1 to BT←(x[i*D+j.. i*D+j+B-1])

return T

int algorithm_1(string x, string y, int k)int B = n2/3/(2k1/3) int D = n/Bset Tx = map(x)set Ty = map(y)(u,v) = characteristic_vectors(Tx , Ty )if HD(u,v) ≤ 4kB then return (ED(x,y) ≤ k)if HD(u,v) ≥ 8kB then return (ED(x,y) ≥ 13 (kn)2/3 )


Remarks In the above algorithm Alice and Bob run the procedure

map and the referee runs the procedure HD, which is the sketching algorithm for Hamming Distance

Notice that Alice and Bob do not use their random coins. Only the referee uses randomness and has bounded error probability

The algorithm solves the k vs Ω((kn)2/3 ) gap problem for all k ≤ n1/2, in quasi-linear time, using sketches of size O(1)

The idea of independently encoding the position of each substring, though simple, has problems. Consider:x = paramana, y = mana, with B = 2, D = 3

Gap algorithm for nonrepetitive strings

We saw that “encoding” the substrings’ position independently, by choosing an appropriate integer D, can lead to problems: in fact we fail to identify many matches even in the presence of just one edit operation (consider the case of the strings x and 1x)

We overcome this handicap by resorting to a method where the “encodings” are correlated.


Idea:we scan the input from left to right, trying to find “anchor” substrings, i.e. identical substrings that occur in x, y at very near positions. All we need to change is the “encoding” of the position: we now map each substring to the region between successive “anchors”

Example:x = ma … ro ... as, y = *ma****ro***as**the substrings starting at a region between successive anchors have the same encoding


Why does this idea work?Remember that we are dealing with a gap algorithm. Hence, if the input strings are very similar, we expect that a sufficiently short substring, chosen from a sufficiently long window is unlikely to contain edit operations and thus has to be matched with a corresponding substring in y in the same window

And how do Alice and Bob choose identical substrings?Alice and Bob cannot communicate with each other, so they pick the “anchors” randomly. In fact they use some (shared) random permutations (remember the shared coins)


Isn’t this too good to be true?Yes, it is: in order for the random permutations to ensure that “anchors” are detected with high probability we must demand that the input strings are non-repetitive.

A string is called (t,l)-non-repetitive if for any window of size l, the l substrings of length t, which start inside the window are distinct


set map(string x, randomness r, int t) //Alice&Bob now use their coins rset T:= emptyint c:=1int i:=1table of anchors atable of regions reg//pick a sliding window of length W and place it at cfor all length t-substrings starting in the interval [c+W..c+2W-1]

produce a Karp-Rabin fingerprint using rpick a random permutation of the fingerprints using ra[i] = the substring with the minimal fingerprint permutationi:=i+1c:=the first character after the ancher

for j=1 to ireg[j] = substring starting after last char of a[i-1] till last char of a[i]T ←(reg[i],i)

return T


int algorithm_2(string x, string y, int k, int t)//t is the non-repetition parameterrandomness rset Tx = map(x,r,t)set Ty = map(y,r,t)(u,v) = characteristic_vectors(Tx , Ty )if HD(u,v) ≤ 3k then return (ED(x,y) ≤ k)if HD(u,v) ≥ 6k then return (ED(x,y) ≥ Ω (tk)2 )


The algorithm’s correctness follows from: Lemma 1:

If ED(x,y) ≤ k then Pr[HD(u,v) ≤ 3k] ≥ 5/6 Lemma 2:

If HD(u,v) ≤ 6kB then ED(x,y) ≤ O(tk)2

For any 1≤t<n, the algorithm solves the k vs Ω((tk)2) gap problem for all k ≤ (n/t)1/2, for all (t,tk)-non-repetitive strings, in polynomial time, using sketches of size O(1).

Approximation algorithm

Key idea: use of graphs Given two strings x,y the edit graph GE is a

representation of ED(x,y), by means of a directed graph.

The vertices correspond to the edit distances

of x[1..i] and y[1..j] for all i,j ≤n.

An edge between vertices corresponds to a single edit operation transforming one substring to the other.


We define the graph G(B) as a (lossy) compression of GE: each vertex corresponds to a pair (i,s), where i=jB, for j=0..n/B and s=-k..k. The bigger parameter B is, the lossier the compression.

Each vertex is closely related with the edit distance of x[1..i] and y[1..i+s] (s denotes the amount by which we shift y with respect to x)

We have two types of edges:a. a-type edges from (i,s’) to (i,s) where |s’-s|=1b. b-type edges from (i-B,s) to (i,s) with w(e)

w(a-type edges) = 1, w(b-type edges) depends on approximation factor c


In GE the weight of the shortest source-sink path corresponds to the optimal alignment of x, y.

Theorem 1: Given two strings x, y and their corresponding graph G(B), let the shortest path P from (0,0) to (n,0) have weight T. Then ED(x,y) ≤ T and T ≤ (2c+2)k, if ED(x,y) ≤k, where c is the approximation factor affiliated with b-type edge weights.


Only problem: finding the shortest path.Dijkstra is too slow (i.e. not linear)If we could figure out the weights of all b-

type edges for a given i simultaneously, perhaps we could solve the problem.

But computing b-type edges is the same as finding the approximate edit distances between x[i+1..i+b] and every substring of y[i+1-k..i+B+k] of length B


c(p,t)-edit pattern matching problem: given a pattern string P of length p and a text string of length t ≥ p, produce numbers d1, d2,…dt-p+1, such that

di /c ≤ ED(P,T[i..i+p-1]) ≤ di for all i Theorem 2: Suppose there is an algorithm that can

solve the c(p,t)-edit pattern matching problem in time TIME(p,t). Let x, y be two strings and G(B) their corresponding graph. Set p=B, t=B+2k and c=c(p,t). Then the shortest path in G(B) can be used to solve the k vs (2c+2)k edit distance gap problem in time O((k+TIME(p,t))n/B) (i.e. the shortest path can be efficiently computed)


graph make_graph(string x, string y, int k, int B) //bigger B→faster algorithm, bigger gapvertices V = emptyedges E = emptyfor j = 0 to n div B //vertices

i = j*Bfor s = -k to k

V←(i,s) for j = 0 to n div B //a-type edges

i = j*Bfor s = -k to k

EA ← ((i,s),(i,s+1),1) //(source, sink, weight)EA ← ((i,s),(i,s-1),1)

for j = 1 to n div B //b-type edgesi = j*Bfor s = -k to k

EB ← ((i-B,s),(i,s),w)return (V, EA EB)


(d1, d2,…dt-p+1) epm_algorithm(string P, string T)//length(T) ≥ length(P)//returns approximate ED(P,S) for all S: length(P)-substrings of T

int fix_weights(int x, int y, int B, int k, graph G, int n) //why n as input…?int d[-k..k] //table of b-type edges weights d = epm_algorithm(x[n-B+1..n], y[n-B+1-k..n+k])if n-B=0 return dfor s = -k to k

EB ← ((i-B,s),(i,s),d[s])G’ = update(G, EB)return compose(d, fix_weights(x,y,B,k,G’,n-B)) //what is compose…?

Let T(i,s) denote the shortest path from (0,0) to (i,s) Notice that fix_weights is recursive: in each call the T(i,s) is simultaneously (fast!)

computed for all s. The idea is to compute the shortest path T(n,0) by recursively computing the shortest paths T(i,s). T(i,s) uses the result T(i-B,s).

No more than n/B recursive calls of fix_weights are needed


int shortest_path(string x, string y, int k, int B)

graph G = make_graph(x,y,k,B)

int T = fix_weights(x,y,k,B,G,n)

return T

Correctness of the gap algorithm (follows from Theorem 1):

1. If ED ≤ k then T ≤ (2c+2)k

2. If ED ≥ (2c+2)k then T ≥ ED ≥ (2c+2)k


Some results which use algorithms for the edit pattern matching problem:

1. quasi-linear time algorithm for k vs O(k2)

2. quasi-linear time algorithm for k vs O(k7/4)

3. quasi-linear time algorithm for k vs O(k3/2) for (k, O(k1/2))-non-repetitive strings

4. 2&3 imply approximation algorithms with factors n3/7 and n1/3 respectively

Embedding edit distance into l1

Question: can {0,1}d endowed with edit distance be embedded into l1 (Rd endowed with L1-metric) with low distortion?

Known results up to 2005:1. Edit distance cannot be embedded into l1 with distortion

less than 3/22. Edit distance can be trivially embedded into l1 with

distortion d3. Edit distance can be embedded into l1 with distortion

(Ostrovsky, Rabani, STOC 2005) Why do we care?

1. Because embeddings allow us to use efficient algorithms of one metric space on instances of another metric space

2. Because embeddings help us learn more about the structure of a metric space

( log log log )2O d d


Let x be a string. Then for any integer s we denote by shifts(x,s) the set

{x[1,|x|-s+1], x[2,|x|-s+2],…, x[s,|x|]

which consists of all length s substrings of x. Theorem:

There exists a universal constant c>0 such that for every integer d>0 there exists an embedding f:({0,1}d, ED)→l1 with distortion at most log log log2c d d


Key idea of the embedding:for sufficiently small d the distortion is indeed that small. So it suffices to break down the string into substrings of approximately the same length (±1), the blocks, and recursively embed into l1 some metric spaces of lower dimension. The spaces embedded must be chosen in such a way, that the concatenation of their scaled embeddings results in the embedding of the original string.

log log log2 d d


Choosing the metric in the recursive step:Let x1, x2,… xn be the blocks. For all i consider the sets shifts(xi ,s) where s ranges over the non-negative powers of logd that are below the block length.Now define the distance between the sets shifts(xi ,s) and shifts(xj ,s) to be the minimum cost perfect matching, where the cost of an edge between 2 elements is their edit distance.This defines a metric that satisfies the above requirements. Ideally we want to embed it in l1


However a good embedding for the metric shifts(xi ,s) is too strong an inductive hypothesis.

Therefore we inductively embed the strings in shifts(xi ,s) into l1 and redefine the edge costs for the minimum cost perfect matching, to be the l1 distances of the embedded strings.

This “inductive” embedding is not one of low distortion, but it also does not need to be one. It just needs to satisfy a weaker property (omitted here) which guarantees the upper bound on the total distortion.

The End…

Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Documents

Transcript of Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.