Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

53
Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Page 1: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Edit Distance

Γιώργος ΠιερράκοςADS – NTUA

4 Ιουνίου 2007

Page 2: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Overview

What is edit distance? Why do we care?A few interesting propertiesSome algorithms from the past for

finding exact edit distanceMore (complicated) algorithms from

today for approximate edit distanceEmbedding edit distance into l1

Page 3: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

What is that?

Suppose we have two strings x,y

e.g. x = kitten

y = sitting

and we want to transform x into y.

We use edit operations: 1. insertions

2. deletions

3. substitutions

Page 4: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

What is that?

A closer look:

k i t t e n

s i t t i n g

1st step: kitten sitten (substitution)

2nd step: sittensittin (substitution)

3rd step: sittinsitting (insertion)

Page 5: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

What is that?

Can we do better?

Answer here is no (obviously)

What about:

x = darladidirladada

y = marmelladara

Tough…

Page 6: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Why do we care?

A lot of applications depend on the similarity of two strings

Computational Biology:

…ATGCATACGATCGATT……TGCAATGGCTTAGCTA…

Animal species from the same family are bound to have more similar DNAs

What about evolutionary biology?

Page 7: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Why do we care?

searching keywords through the net: usually by “mtallica” we mean “metallica”:

Page 8: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Definitions

We care about bit-strings only: Σ = {0,1}n

for i..j<n we denote the substring of x: x[i..j] by xi we denote the i-th bit of x

Associate operations with string positions:

deleting xi ↔ isubstituting xi ↔ iinserting y ↔ position of y after the insertion

Alignment τ of x,y is a sequence of edit operations transforming x into y

Page 9: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Definitions

Length of an alignment is the number of its edit operations

Optimal alignment is one that uses a minimum number of edit operations

Edit Distance of two strings x,y is the length of their optimal alignment: ED(x,y)e.g. ED(kitten, sitting) = 3

Hamming Distance of two equal length strings x,y is the number of positions for which the corresponding symbols are different (xi ≠ yi)e.g. HD(kitten, sittin) = 2

Page 10: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Some interesting properties

1. Triangle Inequality: for any three strings x,y,z of arbitrary lengths:

ED(x,y) ≤ ED(x,z) + ED(z,y)

2. Splitting Inequality: let x, y be strings of lengths n and m respectively. For all i,j:

ED(x,y) ≤ ED(x[1..i],y[1..j])+ED(x[i+1..n],y[j+1..m])

Page 11: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Some interesting properties

3. Some simple upper and lower bounds: Let x, y be strings of lengths n, m (n ≤ m). Then: ED(x,y) ≤ m ED(x,y) ≥ m-n ED(x,y)=0 iff x=y if m=n, ED(x,y) ≤ HD(x,y) ED(x,y) ≥ number of characters (not

counting duplicates) found in x but not in y

Page 12: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Some interesting properties

4. Induced alignment property: insτ(i..j) = number of insertions in interval [i..j] delτ(i..j) = number of deletions in interval [i..j] subτ(i..j) = number of sub/tions in interval [i..j]

Caution: i,j refer to the initial positions of the string (see the string as an indexed array)

shτ(i..j) = insτ(i..j) - delτ(i..j)Intuitively shτ(i..j) is the induced shift on x[i..j]Define shτ(i) = shτ(1..i) and shτ(0) = 0

edτ(i..j) is the induced alignment of τ, i.e. the sub-sequence of edit operations in τ that belong in [i..j]

Page 13: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Some interesting properties

Consider the strings: x = savvato, y = eviva

edτ(4..5)=1 because we insert “i” in pos. 4

τ1 τ2 τ3

x savvatosav-vato

sav-vato

y eviva-- -eviva-- e-viva--

edτ(4..5) 1 1 1

Page 14: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Some interesting properties

Induced alignment property: For any alignment τ of x and y, for all i ≤ j,

edτ(i..j) ≥ ED (x[i..j], y[i+shτ(i-1)..j+ shτ(j)])

τ1 τ2 τ3

shτ(3), shτ(5) 0,0 -1,0 -1,0

x[4..5] va va va

y[4+shτ(3)..5+shτ(5)]) va iva iva

edτ(4..5) 1 1 1

ED(x[4..5], y[4+shτ(3)..5+shτ(5)])) 0 1 1

Page 15: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

A Dynamic Programming Algorithm

Levenshtein ~1965 (Levenshtein distance) induced alignment property principle of optimality

(not exactly) use dynamic programming to solve the problem (quite

similar to subset sum problem) Key ideas:

input: strings s, t with lengths n, m respectively use an (n+1)x(m+1) matrix the invariant maintained throughout the algorithm is that we

can transform the initial segment s[1..i] into t[1..j] using a minimum of d[i,j] operations.

use a traceback algorithm to find the optimal alignment

Page 16: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

A Dynamic Programming Algorithm

int LevenshteinDistance(char s[1..n], char t[1..m])

int d[0..n, 0..m]

for i from 0 to n d[i, 0] := i

for j from 1 to m d[0, j] := j

for i from 1 to n

for j from 1 to m

if s[i] = t[j] then cost := 0 else cost := 1

d[i, j] := minimum(

d[i-1, j] + 1, // deletion

d[i, j-1] + 1, // insertion

d[i-1, j-1] + cost // substitution )

return d[n, m]

Page 17: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

A Dynamic Programming Algorithm

branches correspond to different optimal alignments

the length of the optimal alignment is 5

the algorithm runs in O(n2) time and it can be improved so as to use O(n) space

Page 18: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

A Dynamic Programming Algorithm

the above algorithm returns the exact value of the edit distance of two strings

there exists a (minor) improvement in the algorithm so as to run in time O(n2/logn) by Masek and Paterson (1980): only theoretical interest

O(n2) is too much! We want linear time. Even developing subquadratic time algorithms

for approximating edit distance within a modest factor has proved quite challenging

Page 19: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximating edit distance efficiently

Suppose we know that our strings are either too similar or too dissimilar, namely that their edit distance is either at most k, or greater than l(n,k). Then we can develop a gap algorithm, that decides which of the two holds.

We are going to present three linear algorithms (Yossef, Jayram et. al. FOCS 2004 ): The first one has l= O((kn)2/3) & works for all strings The second one has l= O(k2) & works only for strings

that are non-repetitive The third one is an n3/7-approximation algorithm

which improves to n1/3 if the strings are non-repetitive

Page 20: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximating edit distance efficiently

Why do we care about an efficient gap algorithm?

1. Such algorithms yield approximation algorithms that are as efficient, with the approximation ratio directly correlated with the gap

2. They are useful on their own: there exist problems where two strings can be either too similar or too dissimilar

Page 21: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximating edit distance efficiently

Our model for the first two algorithms: the sketching model: two-party public-coin simultaneous messages communication

complexity protocol persons: Alice, Bob and the Referee goal: to jointly compute f: AxB→C, when Alice has the input

and Bob has the input Alice uses her input a and shared random coins to compute a

sketch sA(a) and then sends it to the referee. Bob does the same and sends a sketch sB(b)

the referee uses the sketches and the shared coins to compute the value of the function f(a,b), with a small error probability (constant)

main measure of complexity: the sketches’ size usually desirable that Alice and Bob are efficient too

a A b B

Page 22: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximating edit distance efficiently

Why the sketching model?Because there already exist sketching algorithms for the Hamming Distance

Idea 1:map the Edit Distance into some form of Hamming Distance and then use the known results for the Hamming Distance

Idea 2:if two strings share a lot of identical substrings in near positions they cannot differ too much.

Page 23: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

Algorithm for arbitrary strings Technique:

1. Map each string to the multiset of all its (overlapping) substrings. Annotate each substring with an “encoding” of its position in the original string.

2. Take the characteristic vectors of the two multisets and run a gap algorithm for their Hamming Distances.

3. Map the results for HD into results for ED

Page 24: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

1st step:two inputs: the size of the input n

the gap parameter kdefine a suitable substring length B(n,k)

and a suitable window parameter D(n,k)map: x→Tx and y→Ty. These sets consist

of pairs (γ,i), where γ is a length B substring and i is the “encoding” of the substrings position

Page 25: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

Encoding method: round down the starting position of the substring to the nearest multiple of D(n,k)

Example:Let x = savvato, B=2 and D=3.Then

Tx = {(sa,0),(av,0),(vv,1),(va,1),(at,1),(to,2)} formally:

Tx = {(x[i..i+B-1],i div D) | i = 1,…n-B+1} B = n2/3/(2k1/3) and D = n/B

Page 26: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

2nd step: Take the characteristic vectors u, v of multisets Tx and

Ty respectively. To do that impose a (lexicographical) order on the elements ofExample:If Tx = {(γ1 , i1 ), (γ2 , i2 ), (γ1 , i3 )} and Ty = {(γ2 , i1 ), (γ1 , i3 ), (γ4 , i5 )} then

= {(γ1 , i1 ),(γ2 , i2 ),(γ1 , i3 ),(γ2 , i1 ),(γ4 , i5 )} order: (γ1 , i1 ), (γ1 , i3 ), (γ2 , i1 ), (γ2 , i2 ), (γ4 , i5 )Then u=11010 and v=01101

But we do not want to sort things (it costs…). So instead of the union consider the set of all (γj , ij )

x yT T

x yT T

Page 27: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

a pair is indicative of substrings of x and y that match (i.e. they are identical in terms of contents and appear at nearby positions in x and y) and corresponds to a j such that uj = vj

a pair is indicative of substrings that do not match and corresponds to a j such that uj ≠ vj

We now have to estimate the Hamming Distance between u and v

( , ) x yi T T

( , ) ( \ ) ( \ )x y y xi T T T T

Page 28: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

Hamming Distance can be approximated using constant-size sketches as shown by Kushilevitz, Ostrovsky, Rabani (2000)

Problem: the Hamming Distance instance our problem is reduced to is exponentially long (set of all (γj , ij )). As a result the time to produce the sketches is prohibitive (Alice and Bob are very slow…)

Solution: new improved sketching method which 1. produces constant size sketches 2. runs in time proportional to the HD of the instances3. solves the k vs (1+ε)k gap HD problem for any ε>0

Notice that HD(u,v) ≤ O(n): no more than 2n distinct substrings

Page 29: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

3rd step: We tune the sketching algorithm for HD to

determine whether HD(u,v) ≤ 4kB or HD(u,v) ≥ 8kB with constant probability of error. The referee, upon receiving the sketches from Alice and Bob, decides that ED(x,y) ≤ k if he finds HD(u,v) ≤ 4kB and ED(x,y) ≥ 13 (kn)2/3 if he finds HD(u,v) ≥ 8kB.

The algorithm’s correctness follows from: Lemma 1: If ED(x,y) ≤ k then HD(u,v) ≤ 4kB Lemma 2: If ED(x,y) ≥ 13 (kn)2/3 then HD(u,v) ≥ 8kB

Page 30: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

set map(string x)set T:= emptyfor i=0 to n div D

for j=1 to BT←(x[i*D+j.. i*D+j+B-1])

return T

int algorithm_1(string x, string y, int k)int B = n2/3/(2k1/3) int D = n/Bset Tx = map(x)set Ty = map(y)(u,v) = characteristic_vectors(Tx , Ty )if HD(u,v) ≤ 4kB then return (ED(x,y) ≤ k)if HD(u,v) ≥ 8kB then return (ED(x,y) ≥ 13 (kn)2/3 )

Page 31: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for arbitrary strings

Remarks In the above algorithm Alice and Bob run the procedure

map and the referee runs the procedure HD, which is the sketching algorithm for Hamming Distance

Notice that Alice and Bob do not use their random coins. Only the referee uses randomness and has bounded error probability

The algorithm solves the k vs Ω((kn)2/3 ) gap problem for all k ≤ n1/2, in quasi-linear time, using sketches of size O(1)

The idea of independently encoding the position of each substring, though simple, has problems. Consider:x = paramana, y = mana, with B = 2, D = 3

Page 32: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for nonrepetitive strings

We saw that “encoding” the substrings’ position independently, by choosing an appropriate integer D, can lead to problems: in fact we fail to identify many matches even in the presence of just one edit operation (consider the case of the strings x and 1x)

We overcome this handicap by resorting to a method where the “encodings” are correlated.

Page 33: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for nonrepetitive strings

Idea:we scan the input from left to right, trying to find “anchor” substrings, i.e. identical substrings that occur in x, y at very near positions. All we need to change is the “encoding” of the position: we now map each substring to the region between successive “anchors”

Example:x = ma … ro ... as, y = *ma****ro***as**the substrings starting at a region between successive anchors have the same encoding

Page 34: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for nonrepetitive strings

Why does this idea work?Remember that we are dealing with a gap algorithm. Hence, if the input strings are very similar, we expect that a sufficiently short substring, chosen from a sufficiently long window is unlikely to contain edit operations and thus has to be matched with a corresponding substring in y in the same window

And how do Alice and Bob choose identical substrings?Alice and Bob cannot communicate with each other, so they pick the “anchors” randomly. In fact they use some (shared) random permutations (remember the shared coins)

Page 35: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for nonrepetitive strings

Isn’t this too good to be true?Yes, it is: in order for the random permutations to ensure that “anchors” are detected with high probability we must demand that the input strings are non-repetitive.

A string is called (t,l)-non-repetitive if for any window of size l, the l substrings of length t, which start inside the window are distinct

Page 36: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for nonrepetitive strings

set map(string x, randomness r, int t) //Alice&Bob now use their coins rset T:= emptyint c:=1int i:=1table of anchors atable of regions reg//pick a sliding window of length W and place it at cfor all length t-substrings starting in the interval [c+W..c+2W-1]

produce a Karp-Rabin fingerprint using rpick a random permutation of the fingerprints using ra[i] = the substring with the minimal fingerprint permutationi:=i+1c:=the first character after the ancher

for j=1 to ireg[j] = substring starting after last char of a[i-1] till last char of a[i]T ←(reg[i],i)

return T

Page 37: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for nonrepetitive strings

int algorithm_2(string x, string y, int k, int t)//t is the non-repetition parameterrandomness rset Tx = map(x,r,t)set Ty = map(y,r,t)(u,v) = characteristic_vectors(Tx , Ty )if HD(u,v) ≤ 3k then return (ED(x,y) ≤ k)if HD(u,v) ≥ 6k then return (ED(x,y) ≥ Ω (tk)2 )

Page 38: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Gap algorithm for nonrepetitive strings

The algorithm’s correctness follows from: Lemma 1:

If ED(x,y) ≤ k then Pr[HD(u,v) ≤ 3k] ≥ 5/6 Lemma 2:

If HD(u,v) ≤ 6kB then ED(x,y) ≤ O(tk)2

For any 1≤t<n, the algorithm solves the k vs Ω((tk)2) gap problem for all k ≤ (n/t)1/2, for all (t,tk)-non-repetitive strings, in polynomial time, using sketches of size O(1).

Page 39: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

Key idea: use of graphs Given two strings x,y the edit graph GE is a

representation of ED(x,y), by means of a directed graph.

The vertices correspond to the edit distances

of x[1..i] and y[1..j] for all i,j ≤n.

An edge between vertices corresponds to a single edit operation transforming one substring to the other.

Page 40: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

We define the graph G(B) as a (lossy) compression of GE: each vertex corresponds to a pair (i,s), where i=jB, for j=0..n/B and s=-k..k. The bigger parameter B is, the lossier the compression.

Each vertex is closely related with the edit distance of x[1..i] and y[1..i+s] (s denotes the amount by which we shift y with respect to x)

We have two types of edges:a. a-type edges from (i,s’) to (i,s) where |s’-s|=1b. b-type edges from (i-B,s) to (i,s) with w(e)

w(a-type edges) = 1, w(b-type edges) depends on approximation factor c

Page 41: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

In GE the weight of the shortest source-sink path corresponds to the optimal alignment of x, y.

Theorem 1: Given two strings x, y and their corresponding graph G(B), let the shortest path P from (0,0) to (n,0) have weight T. Then ED(x,y) ≤ T and T ≤ (2c+2)k, if ED(x,y) ≤k, where c is the approximation factor affiliated with b-type edge weights.

Page 42: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

Only problem: finding the shortest path.Dijkstra is too slow (i.e. not linear)If we could figure out the weights of all b-

type edges for a given i simultaneously, perhaps we could solve the problem.

But computing b-type edges is the same as finding the approximate edit distances between x[i+1..i+b] and every substring of y[i+1-k..i+B+k] of length B

Page 43: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

c(p,t)-edit pattern matching problem: given a pattern string P of length p and a text string of length t ≥ p, produce numbers d1, d2,…dt-p+1, such that

di /c ≤ ED(P,T[i..i+p-1]) ≤ di for all i Theorem 2: Suppose there is an algorithm that can

solve the c(p,t)-edit pattern matching problem in time TIME(p,t). Let x, y be two strings and G(B) their corresponding graph. Set p=B, t=B+2k and c=c(p,t). Then the shortest path in G(B) can be used to solve the k vs (2c+2)k edit distance gap problem in time O((k+TIME(p,t))n/B) (i.e. the shortest path can be efficiently computed)

Page 44: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

graph make_graph(string x, string y, int k, int B) //bigger B→faster algorithm, bigger gapvertices V = emptyedges E = emptyfor j = 0 to n div B //vertices

i = j*Bfor s = -k to k

V←(i,s) for j = 0 to n div B //a-type edges

i = j*Bfor s = -k to k

EA ← ((i,s),(i,s+1),1) //(source, sink, weight)EA ← ((i,s),(i,s-1),1)

for j = 1 to n div B //b-type edgesi = j*Bfor s = -k to k

EB ← ((i-B,s),(i,s),w)return (V, EA EB)

Page 45: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

(d1, d2,…dt-p+1) epm_algorithm(string P, string T)//length(T) ≥ length(P)//returns approximate ED(P,S) for all S: length(P)-substrings of T

int fix_weights(int x, int y, int B, int k, graph G, int n) //why n as input…?int d[-k..k] //table of b-type edges weights d = epm_algorithm(x[n-B+1..n], y[n-B+1-k..n+k])if n-B=0 return dfor s = -k to k

EB ← ((i-B,s),(i,s),d[s])G’ = update(G, EB)return compose(d, fix_weights(x,y,B,k,G’,n-B)) //what is compose…?

Let T(i,s) denote the shortest path from (0,0) to (i,s) Notice that fix_weights is recursive: in each call the T(i,s) is simultaneously (fast!)

computed for all s. The idea is to compute the shortest path T(n,0) by recursively computing the shortest paths T(i,s). T(i,s) uses the result T(i-B,s).

No more than n/B recursive calls of fix_weights are needed

Page 46: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

int shortest_path(string x, string y, int k, int B)

graph G = make_graph(x,y,k,B)

int T = fix_weights(x,y,k,B,G,n)

return T

Correctness of the gap algorithm (follows from Theorem 1):

1. If ED ≤ k then T ≤ (2c+2)k

2. If ED ≥ (2c+2)k then T ≥ ED ≥ (2c+2)k

Page 47: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Approximation algorithm

Some results which use algorithms for the edit pattern matching problem:

1. quasi-linear time algorithm for k vs O(k2)

2. quasi-linear time algorithm for k vs O(k7/4)

3. quasi-linear time algorithm for k vs O(k3/2) for (k, O(k1/2))-non-repetitive strings

4. 2&3 imply approximation algorithms with factors n3/7 and n1/3 respectively

Page 48: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Embedding edit distance into l1

Question: can {0,1}d endowed with edit distance be embedded into l1 (Rd endowed with L1-metric) with low distortion?

Known results up to 2005:1. Edit distance cannot be embedded into l1 with distortion

less than 3/22. Edit distance can be trivially embedded into l1 with

distortion d3. Edit distance can be embedded into l1 with distortion

(Ostrovsky, Rabani, STOC 2005) Why do we care?

1. Because embeddings allow us to use efficient algorithms of one metric space on instances of another metric space

2. Because embeddings help us learn more about the structure of a metric space

( log log log )2O d d

Page 49: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Embedding edit distance into l1

Let x be a string. Then for any integer s we denote by shifts(x,s) the set

{x[1,|x|-s+1], x[2,|x|-s+2],…, x[s,|x|]

which consists of all length s substrings of x. Theorem:

There exists a universal constant c>0 such that for every integer d>0 there exists an embedding f:({0,1}d, ED)→l1 with distortion at most log log log2c d d

Page 50: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Embedding edit distance into l1

Key idea of the embedding:for sufficiently small d the distortion is indeed that small. So it suffices to break down the string into substrings of approximately the same length (±1), the blocks, and recursively embed into l1 some metric spaces of lower dimension. The spaces embedded must be chosen in such a way, that the concatenation of their scaled embeddings results in the embedding of the original string.

log log log2 d d

Page 51: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Embedding edit distance into l1

Choosing the metric in the recursive step:Let x1, x2,… xn be the blocks. For all i consider the sets shifts(xi ,s) where s ranges over the non-negative powers of logd that are below the block length.Now define the distance between the sets shifts(xi ,s) and shifts(xj ,s) to be the minimum cost perfect matching, where the cost of an edge between 2 elements is their edit distance.This defines a metric that satisfies the above requirements. Ideally we want to embed it in l1

Page 52: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

Embedding edit distance into l1

However a good embedding for the metric shifts(xi ,s) is too strong an inductive hypothesis.

Therefore we inductively embed the strings in shifts(xi ,s) into l1 and redefine the edge costs for the minimum cost perfect matching, to be the l1 distances of the embedded strings.

This “inductive” embedding is not one of low distortion, but it also does not need to be one. It just needs to satisfy a weaker property (omitted here) which guarantees the upper bound on the total distortion.

Page 53: Edit Distance Γιώργος Πιερράκος ADS – NTUA 4 Ιουνίου 2007.

The End…