Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work...

Overcoming the L1 Non-Embeddability Barrier

Robert Krauthgamer (Weizmann Institute)

Joint work with Alexandr Andoni and Piotr Indyk (MIT)

Overcoming the L_1 non-embeddability barrier 2

Algorithms on Metric Spaces Fix a metric M Fix a computational problem

Solve problem under M

Ulam metric

ED(x,y) = minimum number of edit operations that transform x into y.edit operation = insert/delete/ substitute a character

ED(0101010, 1010101) = 2 Nearest Neighbor Search:

Preprocess n strings, so that given a query string, can find the closest string to it.

Compute distance between x,yEarthmover distance

Hamming distance

Motivation for Nearest Neighbor Many applications:

Image search (Euclidean dist, Earth-mover dist) Processing of genetic information, text processing (edit dist.) many others…

GenericSearchEngine

A General Tool: Embeddings An embedding of M into a host metric

(H,dH) is a map f : M→H preserves distances approximately

has distortion A ≥ 1 if for all x,y dM(x,y) ≤ dH(f(x),f(y)) ≤ A*dM(x,y)

Why? If H is “easy” (= can solve efficiently

computational problems like NNS) Then get good algorithms for the original

space M!f

Host space?Popular target metric: ℓ1 Have efficient algorithms:

Distance estimation: O(d) for d-dimensional space (often less) NNS: c-approx with O(n1/c) query time and O(n1+1/c) space [IM98]

Powerful enough for some things…

Metric References Upper bound Lower bound

Edit distance over 0,1d [OR05];

[KN05,KR06,AK07]2O(√log d) Ω(log d)

Ulam (= edit distance over permutations)

[CK06];

[AK07]O(log d) Ω:(log d)

Block edit distance over 0,1d [MS00, CM07];

[Cor03]O(log d) 4/3

Earthmover distance in 2

(sets of size s)

[Cha02, IT03];

[NS07]O(log s) (log1/2 s)

Earthmover distance in 0,1d

(set of size s)

[AIK08];

[KN05]O(log s*log d) (log s)

ℓ1=real space withd1(x,y) =∑i |xi-yi|

Below logarithmic? Cannot work with ℓ1

Other possibilities? (ℓ2)p is bigger and algorithmically tractable

but not rich enough (often same lower bounds)

ℓ∞ is rich (includes all metrics), but not efficient computationally usually (high dimension)

And that’s roughly it… (at least for efficient NNS)

(ℓ2)p=real space withdist2p(x,y)=||x-y||2p

ℓ∞=real space withdist∞(x,y)=maxi|xi-yi|

d∞,1

Meet our new host

Iterated product space, Ρ22,∞,1=

L °(`2)2

L ¯`1

`®1®1

x = (x1; : : :xa) 2 R ®

d1(x;y) =P ®

i=1 jxi ¡ yi j

x = (x1; : : :x¯ ) 2 `®1 £ `®

1 £ :: :`®1

d1 ;1(x;y) = max¯i=1 d1(xi ;yi )

x = (x1; : : :x° ) 2L ¯

`1`®1 £

L ¯`1

`®1 £ :: :

L ¯`1

d22;1 ;1(x;y) =P °

i=1(d1 ;1(xi ;yi ))2

d∞,1

d∞,1d22,∞,1

Why Ρ22,∞,1?

Because we can… Theorem 1. Ulam embeds into Ρ22,∞,1 with O(1) distortion

Dimensions (γ,β,α)=(d, log d, d)

Theorem 2. Ρ22,∞,1 admits NNS on n points with O(log log n) approximation O(nε) query time and O(n1+ε) space

In fact, there is more for Ulam…

Algorithmicallytractable

L °(`2)2

L ¯`1

Our Algorithms for Ulam Ulam = edit on strings where each symbol appears at most once

A classical distance between rankings Exhibits hardness of misalignments (as in general edit)

All lower bounds same as for general edit (up to Θ() ) Distortion of embedding into ℓ1 (and (ℓ2)p, etc): Θ(log d)

Our approach implies new algorithms for Ulam:1. NNS with O(log log n) approx, O(nε) query time

Can improve to O(log log d) approx

2. Sketching with O(1)-approx in logO(1) d space

3. Distance estimation with O(1)-approx in time

ED(1234567, 7123456) = 2

[BEKMRRS03]: when ED¼d, approx dε in O(d1-2ε) time

If we ever hope for approximation <<log d for NNS under general edit,first we have to get it under Ulam!

Theorem 1

Theorem 1. Can embed Ulam into Ρ22,∞,1 with O(1) distortion Dimensions (γ,β,α)=(d, log d, d)

Proof “Geometrization” of Ulam characterizations Previously studied in the context of testing monotonicity (sortedness):

Sublinear algorithms [EKKRV98, ACCL04] Data-stream algorithms [GJKK07, GG07, EH08]

L °(`2)2

L ¯`1

Thm 1: Characterizing Ulam Consider permutations x,y over [d]

Assume for now: x = identity permutation Idea:

Count # chars in y to delete to obtain increasing sequence (≈ Ulam(x,y)) Call them faulty characters

Issues: Ambiguity… How do we count them?

123456789

234657891

123456789

341256789

Thm 1: Characterization – inversions Definition: chars a<b form inversion if b precedes a in y

How to identify faulty char? Has an inversion?

Doesn’t work: all chars might have inversion Has many inversions?

Still can miss “faulty” chars Has many inversions locally?

Same problem

123456789

234567891

123456789

213456798

123456789

567981234

Check if either is true!

Thm 1: Characterization – faulty chars Definition 1: a is faulty if exists K>0 s.t.

a is inverted w.r.t. a majority of the K symbols preceding a in y (ok to consider K=2k)

Lemma [ACCL04, GJKK07]: # faulty chars = Θ(Ulam(x,y)).

123456789

234567891

4 characters preceding 1 (all inversions with 1)

Thm 1: CharacterizationEmbedding To get embedding, need:

1. Symmetrization (neither string is identity)

2. Deal with “exists”, “majority”…?

To resolve (1), use instead X[a;K] …

Definition 2: a is faulty if exists K=2k such that |X[a;2k] Δ Y[a;2k]| > 2k (symmetric difference)

123456789

123467895

Y[5;4]

X[5;4]

E:g: 1X [5;22] = (1;1;1;1;0;0;0;0;0)

°°1X [a;2k ] ¡ 1Y [a;2k ]

Thm 1: Embedding – final step We have

Replace by weight?

Final embedding:

123456789

123467895

Y[5;22]

X[5;22]

Ulam(x;y) ¼dX

maxk=1¢¢¢logd

Âh°°1X [a;2k ] ¡ 1Y [a;2k ]

equal 1 iff true

Ulam(x;y) ¼dX

maxk=1¢¢¢logd

k1X [a;2k ] ¡ 1Y [a;2k ]k1

2¢2k( )2

f (x) =³ ¡

12¢2k 1X [a;2k ]

¢k=1::: logd]

a=1:::d2

L d(`2)2

L logd`1

Theorem 2

Theorem 2. Ρ22,∞,1 admits NNS on n points O(log log n) approximation O(nε) query time and O(n1+ε) space for any small ε

(ignoring (αβγ)O(1))

A rather general approach “LSH” on ℓ1-products of general metric spaces

Of course, cannot do, but can reduce to ℓ∞-products

L °(`2)2

L ¯`1

Thm 2: Proof

Let’s start from basics: ℓ1α

[IM98]: c-approx with O(n1/c) query time and O(n1+1/c) space (ignoring αO(1))

Ok, what about L ¯

`1`®1

L ¯`1

L °(`2)2

L ¯`1

Suppose: NNS for M with• cM-approx• QM query time• SM space.

Then: NNS for • O(cM * log log n) -approx• O(QM) query time• O(SM * n1+ε) space.

Thm 2: What about (ℓ2)2-product? Enough to consider

(for us, M is the l1-product)

Off-the-shelf? [I04]: gives space ~n or >log n approximation

We reduce to multiple NNS queries under Instructive to first look at NNS for standard ℓ1 …

L °`1

Thm 2: Review of NNS for ℓ1 LSH family: collection H of

hash functions such that: For random hH (parameter >0)

Pr[h(q)=h(p)] ≈ 1-||q-p||1 /

Query just uses primitive:

Can obtain H by imposing randomly-shifted grid of side-length

Then for h defined by ri2[0, ] at random, primitive becomes:

“return all points p such that h(q)=h(p)

“return all p s.t. |qi-pi|<ri for all i[d]

Thm 2: LSH for ℓ1-product Intuition: abstract LSH! Recall we had:

for ri random from [0, ],

point p returned if for all i: |qi-pi|<ri

Equivalently For all i:

maxi1r i

jqi ¡ pi j < 1

ℓ∞ product of R!

“return all points p’s such thatmaxi dM(qi,pi)/ri<1

For ℓ1

L °`1

“return all p s.t. |qi-pi|<ri for all i[d]

Thm 2: Final Thus, sufficient to solve primitive:

We reduced NNS over

to several instances of NNS over(with appropriately scaled coordinates)

Approximation is O(1)*O(log log n) Done!

“return all points p’s such that maxi dM(qi,pi)/ri<1 (in fact, for k independent choices of (r1,…rd))

L °`1

ML ° k

L °(`2)2

L ¯`1

`®1Take-home message:

Can embed combinatorial metrics into iterated product spaces Works for Ulam (=edit on non-repetitive strings)

Approach bypasses non-embeddability results into usual-suspect spaces like ℓ1, (ℓ2)2 …

Open: Embeddings for edit over

0,1d, EMD, other metrics? Understanding product

spaces?[Jayram-Woodruff]: sketching

Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work...

Documents

Transcript of Overcoming the L 1 Non- Embeddability Barrier Robert Krauthgamer (Weizmann Institute) Joint work...

Property Testing of Data Dimensionality Robert Krauthgamer ICSI and UC Berkeley Joint work with Ori Sasson (Hebrew U.)

Sketching and Embedding are Equivalent for Norms Alexandr Andoni (Columbia) Robert Krauthgamer (Weizmann Inst) Ilya Razenshteyn (MIT) 1.

External Memory Algorithms for Geometric Problems Piotr Indyk (slides partially by Lars Arge and Jeff Vitter)

Polylogarithmic Private Approximations and Efficient Matching Piotr Indyk MIT David Woodruff MIT, Tsinghua TCC 2006.

Algorithmic Frontiers of Doubling Metric Spaces Robert Krauthgamer Weizmann Institute of Science Based on joint works with Yair Bartal, Lee-Ad Gottlieb,

Lecturer: Moni Naor Algorithmic Game Theory Uri Feige Robi Krauthgamer Moni Naor Lecture 10: Mechanism Design.

Searching on Multi-Dimensional Data COL 106 Slide Courtesy: Dan Tromer, Piotyr Indyk, George Bebis.

The descriptive set theoretical complexity of the ...lc2011/Slides/MottoRos.pdf · The descriptive set theoretical complexity of the embeddability relation on uncountable models Luca

Sketching, streaming, and sub-linear space algorithmspeople.csail.mit.edu/indyk/ita-web.pdf · Sketching, streaming, and sub-linear space algorithms Piotr Indyk MIT (currently at

Embeddings of Polyhedra and Compacta in Euclidean Spacesigce.rc.unesp.br/Home/Departamentos47/matematica/topologiaalgeb... · PROBLEM: Find conditions for embeddability of k-dimensional

Tight Lower Bounds for the Distinct Elements Problem David Woodruff MIT dpwood@mit.edu Joint work with Piotr Indyk.

informix Embeddability and Autonomics

Recent&Developments&in&the& Sparse&Fourier&Transformpeople.csail.mit.edu/indyk/fourier-gsip.pdfRecent&Developments&in&the& Sparse&Fourier&Transform! Piotr!Indyk! MIT! Jointwork!with!

Spectral Approaches to Nearest Neighbor Search arXiv:1408.0751 Robert Krauthgamer (Weizmann Institute) Joint with: Amirali Abdullah, Alexandr Andoni, Ravi.

1 Streaming Algorithms for Geometric Problems Piotr Indyk MIT.

Tutorial on Compressed Sensing · Tutorial on Compressed Sensing (or Compressive Sampling, or Linear Sketching) Piotr Indyk MIT

1 Efficient Algorithms for Substring Near Neighbor Problem Alexandr Andoni Piotr Indyk MIT.

Chantier d'usage: NeoTEX...[1998-ACM Th. of computing-Indyk Motwani] [1999-VLDB-Gionis Indyk Motwani] prendre r «extraits»desobjetsàcomparer si ces r «extraits»sontlesmêmes(hashage

Vertex sparsifiers: New results from old techniques (and some open questions) Robert Krauthgamer (Weizmann Institute) Joint work with Matthias Englert,

B-trees and kd-trees Piotr Indyk (slides partially by Lars Arge from Duke U)