Post on 14-Dec-2015
Overcoming the L1 Non-Embeddability Barrier
Robert Krauthgamer (Weizmann Institute)
Joint work with Alexandr Andoni and Piotr Indyk (MIT)
Overcoming the L_1 non-embeddability barrier 2
Algorithms on Metric Spaces Fix a metric M Fix a computational problem
Solve problem under M
Ulam metric
ED(x,y) = minimum number of edit operations that transform x into y.edit operation = insert/delete/ substitute a character
ED(0101010, 1010101) = 2 Nearest Neighbor Search:
Preprocess n strings, so that given a query string, can find the closest string to it.
Compute distance between x,yEarthmover distance
…
…
Hamming distance
Overcoming the L_1 non-embeddability barrier 3
Motivation for Nearest Neighbor Many applications:
Image search (Euclidean dist, Earth-mover dist) Processing of genetic information, text processing (edit dist.) many others…
GenericSearchEngine
Overcoming the L_1 non-embeddability barrier 4
A General Tool: Embeddings An embedding of M into a host metric
(H,dH) is a map f : M→H preserves distances approximately
has distortion A ≥ 1 if for all x,y dM(x,y) ≤ dH(f(x),f(y)) ≤ A*dM(x,y)
Why? If H is “easy” (= can solve efficiently
computational problems like NNS) Then get good algorithms for the original
space M!f
Overcoming the L_1 non-embeddability barrier 5
Host space?Popular target metric: ℓ1 Have efficient algorithms:
Distance estimation: O(d) for d-dimensional space (often less) NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98]
Powerful enough for some things…
Metric References Upper bound Lower bound
Edit distance over 0,1d [OR05];
[KN05,KR06,AK07]2O(√log d) Ω(log d)
Ulam (= edit distance over permutations)
[CK06];
[AK07]O(log d) Ω:(log d)
Block edit distance over 0,1d [MS00, CM07];
[Cor03]O(log d) 4/3
Earthmover distance in 2
(sets of size s)
[Cha02, IT03];
[NS07]O(log s) (log1/2 s)
Earthmover distance in 0,1d
(set of size s)
[AIK08];
[KN05]O(log s*log d) (log s)
ℓ1=real space withd1(x,y) =∑i |xi-yi|
Overcoming the L_1 non-embeddability barrier 6
Below logarithmic? Cannot work with ℓ1
Other possibilities? (ℓ2)p is bigger and algorithmically tractable
but not rich enough (often same lower bounds)
ℓ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension)
And that’s roughly it… (at least for efficient NNS)
(ℓ2)p=real space withdist2p(x,y)=||x-y||2p
ℓ∞=real space withdist∞(x,y)=maxi|xi-yi|
Overcoming the L_1 non-embeddability barrier 7
d∞,1
d1
…
Meet our new host
Iterated product space, Ρ22,∞,1=
L °(`2)2
L ¯`1
`®1
L ¯`1
`®1®1
x = (x1; : : :xa) 2 R ®
d1(x;y) =P ®
i=1 jxi ¡ yi j
x = (x1; : : :x¯ ) 2 `®1 £ `®
1 £ :: :`®1
d1 ;1(x;y) = max¯i=1 d1(xi ;yi )
x = (x1; : : :x° ) 2L ¯
`1`®1 £
L ¯`1
`®1 £ :: :
L ¯`1
`®1
d22;1 ;1(x;y) =P °
i=1(d1 ;1(xi ;yi ))2
β
α
γ
d1
…
d∞,1
d1
…
d∞,1d22,∞,1
Overcoming the L_1 non-embeddability barrier 8
Why Ρ22,∞,1?
Because we can… Theorem 1. Ulam embeds into Ρ22,∞,1 with O(1) distortion
Dimensions (γ,β,α)=(d, log d, d)
Theorem 2. Ρ22,∞,1 admits NNS on n points with O(log log n) approximation O(nε) query time and O(n1+ε) space
In fact, there is more for Ulam…
Rich
Algorithmicallytractable
L °(`2)2
L ¯`1
`®1
Overcoming the L_1 non-embeddability barrier 9
Our Algorithms for Ulam Ulam = edit on strings where each symbol appears at most once
A classical distance between rankings Exhibits hardness of misalignments (as in general edit)
All lower bounds same as for general edit (up to Θ() ) Distortion of embedding into ℓ1 (and (ℓ2)p, etc): Θ(log d)
Our approach implies new algorithms for Ulam:1. NNS with O(log log n) approx, O(nε) query time
Can improve to O(log log d) approx
2. Sketching with O(1)-approx in logO(1) d space
3. Distance estimation with O(1)-approx in time
ED(1234567, 7123456) = 2
[BEKMRRS03]: when ED¼d, approx dε in O(d1-2ε) time
If we ever hope for approximation <<log d for NNS under general edit,first we have to get it under Ulam!
Overcoming the L_1 non-embeddability barrier 10
Theorem 1
Theorem 1. Can embed Ulam into Ρ22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d)
Proof “Geometrization” of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness):
Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08]
L °(`2)2
L ¯`1
`®1
Overcoming the L_1 non-embeddability barrier 11
Thm 1: Characterizing Ulam Consider permutations x,y over [d]
Assume for now: x = identity permutation Idea:
Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) Call them faulty characters
Issues: Ambiguity… How do we count them?
123456789
234657891
123456789
341256789
X=
y=
Overcoming the L_1 non-embeddability barrier 12
Thm 1: Characterization – inversions Definition: chars a<b form inversion if b precedes a in y
How to identify faulty char? Has an inversion?
Doesn’t work: all chars might have inversion Has many inversions?
Still can miss “faulty” chars Has many inversions locally?
Same problem
123456789
234567891
123456789
213456798
123456789
567981234
Check if either is true!
X=
y=
Overcoming the L_1 non-embeddability barrier 13
Thm 1: Characterization – faulty chars Definition 1: a is faulty if exists K>0 s.t.
a is inverted w.r.t. a majority of the K symbols preceding a in y (ok to consider K=2k)
Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)).
123456789
234567891
4 characters preceding 1 (all inversions with 1)
Overcoming the L_1 non-embeddability barrier 14
Thm 1: CharacterizationEmbedding To get embedding, need:
1. Symmetrization (neither string is identity)
2. Deal with “exists”, “majority”…?
To resolve (1), use instead X[a;K] …
Definition 2: a is faulty if exists K=2k such that |X[a;2k] Δ Y[a;2k]| > 2k (symmetric difference)
123456789
123467895
Y[5;4]
X[5;4]
E:g: 1X [5;22] = (1;1;1;1;0;0;0;0;0)
°°1X [a;2k ] ¡ 1Y [a;2k ]
°°
1> 2k
Overcoming the L_1 non-embeddability barrier 15
Thm 1: Embedding – final step We have
Replace by weight?
Final embedding:
123456789
123467895
Y[5;22]
X[5;22]
Ulam(x;y) ¼dX
a=1
maxk=1¢¢¢logd
Âh°°1X [a;2k ] ¡ 1Y [a;2k ]
°°
1> 2k
i
equal 1 iff true
Ulam(x;y) ¼dX
a=1
maxk=1¢¢¢logd
k1X [a;2k ] ¡ 1Y [a;2k ]k1
2¢2k( )2
f (x) =³ ¡
12¢2k 1X [a;2k ]
¢k=1::: logd]
´
a=1:::d2
L d(`2)2
L logd`1
d1
Overcoming the L_1 non-embeddability barrier 16
Theorem 2
Theorem 2. Ρ22,∞,1 admits NNS on n points O(log log n) approximation O(nε) query time and O(n1+ε) space for any small ε
(ignoring (αβγ)O(1))
A rather general approach “LSH” on ℓ1-products of general metric spaces
Of course, cannot do, but can reduce to ℓ∞-products
L °(`2)2
L ¯`1
`®1
Overcoming the L_1 non-embeddability barrier 17
Thm 2: Proof
Let’s start from basics: ℓ1α
[IM98]: c-approx with O(n1/c) query time and O(n1+1/c) space (ignoring αO(1))
Ok, what about L ¯
`1`®1
L ¯`1
M
L °(`2)2
L ¯`1
`®1
Suppose: NNS for M with• cM-approx• QM query time• SM space.
Then: NNS for • O(cM * log log n) -approx• O(QM) query time• O(SM * n1+ε) space.
[I02]
Overcoming the L_1 non-embeddability barrier 18
Thm 2: What about (ℓ2)2-product? Enough to consider
(for us, M is the l1-product)
Off-the-shelf? [I04]: gives space ~n or >log n approximation
We reduce to multiple NNS queries under Instructive to first look at NNS for standard ℓ1 …
L °`1
M
L °`1
M
Overcoming the L_1 non-embeddability barrier 19
Thm 2: Review of NNS for ℓ1 LSH family: collection H of
hash functions such that: For random hH (parameter >0)
Pr[h(q)=h(p)] ≈ 1-||q-p||1 /
Query just uses primitive:
Can obtain H by imposing randomly-shifted grid of side-length
Then for h defined by ri2[0, ] at random, primitive becomes:
pq
“return all points p such that h(q)=h(p)
“return all p s.t. |qi-pi|<ri for all i[d]
Overcoming the L_1 non-embeddability barrier 20
Thm 2: LSH for ℓ1-product Intuition: abstract LSH! Recall we had:
for ri random from [0, ],
point p returned if for all i: |qi-pi|<ri
Equivalently For all i:
maxi1r i
jqi ¡ pi j < 1
pq
ℓ∞ product of R!
“return all points p’s such thatmaxi dM(qi,pi)/ri<1
For ℓ1
L °`1
MFor
“return all p s.t. |qi-pi|<ri for all i[d]
Overcoming the L_1 non-embeddability barrier 21
Thm 2: Final Thus, sufficient to solve primitive:
We reduced NNS over
to several instances of NNS over(with appropriately scaled coordinates)
Approximation is O(1)*O(log log n) Done!
“return all points p’s such that maxi dM(qi,pi)/ri<1 (in fact, for k independent choices of (r1,…rd))
L °`1
M
L °`1
ML ° k
`1M
For
Overcoming the L_1 non-embeddability barrier 22
L °(`2)2
L ¯`1
`®1Take-home message:
Can embed combinatorial metrics into iterated product spaces Works for Ulam (=edit on non-repetitive strings)
Approach bypasses non-embeddability results into usual-suspect spaces like ℓ1, (ℓ2)2 …
Open: Embeddings for edit over
0,1d, EMD, other metrics? Understanding product
spaces?[Jayram-Woodruff]: sketching