Approximating Edit Distance in Near-Linear Time
description
Transcript of Approximating Edit Distance in Near-Linear Time
Approximating Edit Distance in Near-Linear Time
Alexandr Andoni (MIT)
Joint work with Krzysztof Onak (MIT)
Edit Distance
For two strings x,y ∑n
ed(x,y) = minimum number of edit operations to transform x into y Edit operations = insertion/deletion/substitution
Important in: computational biology, text processing, etc
Example:
ED(0101010, 1010101) = 2
Computing Edit Distance
Problem: compute ed(x,y) for given x,y{0,1}n
Exactly: O(n2) [Levenshtein’65] O(n2/log2 n) for |∑|=O(1) [Masek-Paterson’80]
Approximately in n1+o(1) time: n1/3+o(1) approximation [Batu-Ergun-Sahinalp’06], improving
over [Sahinalp-Vishkin’96, Cole-Hariharan’02, BarYossef-Jayram-Krauthgamer-Kumar’04]
Sublinear time: ≤n1-ε vs ≥n/100 in n1-2ε time [Batu-Ergun-Kilian-Magen-
Raskhodnikova-Rubinfeld-Sami’03]
Computing via embedding into ℓ1
Embedding: f:{0,1}n → ℓ1
such that ed(x,y) ≈ ||f(x) - f(y)||1 up to some distortion (=approximation) Can compute ed(x,y) in time to compute f(x)
Best embedding by [Ostrovsky-Rabani’05]: distortion = 2O(√log n)
Computation time: ~n2 randomized (and similar dimension)
Helps for nearest neighbor search, sketching, but not computation…
Our result
Theorem: Can compute ed(x,y) in n*2O(√log n) time with 2O(√log n) approximation
While uses some ideas of [OR’05] embedding, it is not an algorithm for computing the [OR’05] embedding
Review of Ostrovsky-Rabani embedding
φm = embedding of strings of length m δ(m) = distortion of φm
Embedding is recursive Partition into b blocks (b later chosen to be exp(√log m)) Use embeddings φk for k ≤ m/b
Embed each block separately as follows…
m/b
X=
Ostrovsky-Rabani embedding (II)
s
E1s= rec. embedding of the s substrings
Want to approximate ed(x,y) by ∑i=1..b ∑sS TEMDs(Ei
s(x), Eis(y))
EMD(A,B) = min-cost bipartite matching
Finish by embedding TEMD into ℓ1 with small distortion
E2s E3
s Ebs
X=
T (thresholded)
Distortion of [OR] embedding
Suppose can embed TEMD into ℓ1 with distortion (log m)O(1)
Then [Ostrovsky-Rabani’05] show that distortion of φm is δ(m) ≤ (log m)O(1) * [δ(m/b) + b]
For b=exp[√log m] δ(m) ≤ exp[O(√log m)]
Why it is expensive to compute [OR] embedding
In first step, need to compute recursive embedding for ~n/b strings of length ~n/b
The dimension blows up
s
X=
E1s= rec. embedding of the s substrings
Our Algorithm
For each length m in some fixed set L[n],compute vectors vi
mℓ1 such that ||vi
m – vjm||1 ≈ ed( z[i:i+m], z[j:j+m] )
up to distortion δ(m) Dimension of vi
m is only O(log2 n) Vectors vi
m are computed inductively from vik for
k≤m/b (kL) Output: ed(x,y)≈||v1
n/2 – vn/2+1n/2||1 (i.e., for m=n/2=|x|
=|y|)
i
z[i:i+m]
z=
x y
Idea: intuition
For each mL, compute φm(z[i:i+m]) as in the O-R recursive step except we use vectors vi
k, k<m/b & kL, in place of recursive embeddings of shorter substrings (sets Ei
s) Resulting φm(z[i:i+m]) have high dimension, >m/b…
Use Bourgain’s Lemma to vectors φm(z[i:i+m]), i=1..n-m, [Bourgain]: given n vectors qi, construct n vectors qi of O(log2
n) dimension such that ||qi-qj||1 ≈ ||qi-qj||1 up to O(log n) distortion.
Apply to vectors φm(z[i:i+m]) to obtain vectors vim of
polylogaritmic dimension incurs O(log n) distortion at each step of recursion. but OK as
there are only ~√log n steps, giving an additional distortion of only exp[O(√log n)]
||vim – vj
m||1 ≈ ed( z[i:i+m], z[j:j+m] )
Idea: implementation
Essential step is:Main Lemma: fix n vectors viℓ1, of
dimension p=O(log2n). Let s<n. Define Ai={vi, vi+1, …, vi+s-1}.
Then we can compute vectors qiℓ1k for
k=O(log2n) such that ||qi – qj||1≈ TEMD(Ai, Aj) up to distortion logO(1) n
Computing qi’s takes O(n) time.
Proof of Main Lemma
Graph-metric: shortest path on a weighted graph
Sparse: O(n) edges“low” = logO(1) nmin
k M is semi-metric on Mk with “distance”
dmin,M(x,y)=mini=1..kdM(xi,yi)
TEMD over n sets Ai
minlow ℓ1
high
minlow ℓ1
low
minlow tree-metric
sparse graph-metric
O(log2 n)
O(1)
O(log n)
O(log3n)
ℓ1low
O(log n)[Bourgain](efficient)
Step 1
Lemma 1: can embed TEMD over n sets in ({0..M}p, ℓ1) into min
O(log n) ℓ1M^p with O(log2n) distortion, w.h.p.
Use [A-Indyk-Krauthgamer’08] (similar to Ostrovsky-Rabani embedding)
Embedding: for each Δ = powers of 2 impose a randomly-shifted grid one coordinate per cell, equal
to # of points in the cell Theorem [AIK]:
no contraction w.h.p. expected expansion = O(log2 n)
Just repeat O(log n) times
TEMD over n sets Ai
minlow ℓ1
high
O(log2 n)
Step 2
Lemma 2: can embed an n point set from ℓ1M into
minO(log n) ℓ1
k, for k=O(log3 n), with O(1) distortion. Use (weak) dimensionality reduction in ℓ1
Thm [Indyk’06]: Let A be matrix of size M by k=O(log3 n) with each element chosen from Cauchy distribution. Then for any x=Ax, y=Ay: no contraction: ||x-y||1≥||x-y||1 (w.h.p.) 5-expansion: ||x-y||1≤5*||x-y||1 (with 0.01 probability)
Just use O(log n) of such embeddings
minlow ℓ1
high
minlow ℓ1
low
O(1)
Efficiency of Step 1+2
From step 1+2, we get some embedding f() of sets Ai={vi, vi+1, …, vi+s-1} into min
low ℓ1
low
Naively would take Ω(n*s)=Ω(n2) time to compute all f(Ai)
More efficiently: Note that f() is linear: f(A) = ∑aA f(a) Then f(Ai) = f(Ai-1)-f(vi-1)+f(vi+s-1) Compute f(Ai) in order, for a total of O(n) time
Step 3
Lemma 3: can embed ℓ1 over {0..M}p into min
O(log^2 n) tree-m, with O(log n) distortion.
For each Δ = a power of 2, take O(log n) random grids. Each grid gives a min-coordinate
minlow ℓ1
low
minlow tree-metric
O(log n)
∞
Δ
Step 4
Lemma 4: suppose have n points in minlow
tree-m, which approximates a metric up to distortion D. Can embed into a graph-metric of size O(n) with distortion D.
minlow tree-metric
sparse graph-metric
O(log3n)
Step 5
Lemma 5: Given a graph with m edges, can embed the graph-metric into ℓ1
low with O(log n) distortion in O(m) time.
Just implement Bourgain’s embedding: Choose O(log2 n) sets Bi
Need to compute the distance from each node to each Bi
For each Bi can compute its distance to each node using Dijkstra’s algorithm in O(m) time
sparse graph-metric
ℓ1low
O(log n)
Summary of Main Lemma
Min-product helps to get low dimension (~small-size sketch) bypasses impossibility
of dim-reduction in ℓ1
Ok that it is not a metric, as long as it is close to a metric
TEMD over n sets Ai
minlow ℓ1
high
minlow ℓ1
low
minlow tree-metric
sparse graph-metric
O(log2 n)
O(1)
O(log n)
O(log3n)
ℓ1low
O(log n)
oblivious
non-oblivious
Conclusion + a question
Theorem: can compute ed(x,y) in
n*2O(√log n) time with 2O(√log n) approximation
Question: can we do the following “oblivious” dimensionality reduction in ℓ1
Given n, construct a randomized embedding φ:ℓ1
M→ℓ1polylog n such that for any v1…vnℓ1
M, with high probability, φ has distortion logO(1) n on these vectors?
If φ exists, it cannot be linear [Charikar-Sahai’02]