Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint...

22
Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn

description

Motivation  Generic setup:  Points model objects (e.g. images)  Distance models (dis)similarity measure  Application areas:  machine learning: k-NN rule  speech/image/video/music recognition, vector quantization, bioinformatics, etc…  Distances:  Hamming, Euclidean, edit distance, earthmover distance, etc…  Core primitive for: closest pair, clustering …

Transcript of Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint...

Page 1: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Optimal Data-Dependent Hashing for

Nearest Neighbor Search

Alex Andoni(Columbia University)

Joint work with: Ilya Razenshteyn

Page 2: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Nearest Neighbor Search (NNS)

Preprocess: a set of points

Query: given a query point , report a point with the smallest distance to

𝑞𝑝∗

2

Page 3: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Motivation

Generic setup: Points model objects (e.g. images) Distance models (dis)similarity measure

Application areas: machine learning: k-NN rule speech/image/video/music recognition,

vector quantization, bioinformatics, etc… Distances:

Hamming, Euclidean,edit distance, earthmover distance,

etc… Core primitive for: closest pair,

clustering …

000000011100010100000100010100011111

000000001100000100000100110100111111 𝑞

𝑝∗

3

Page 4: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Curse of Dimensionality All exact algorithms degrade rapidly with

the dimension

4

Algorithm Query time SpaceFull indexing (Voronoi diagram size)No indexing – linear scan

Page 5: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Approximate NNS

-near neighbor: given a query point , report a point s.t. as long as there is some point

within distance

Practice: use for exact NNS Filtering: gives a set of candidates

(hopefully small)

𝑟𝑞𝑝∗

𝑝 ′

𝑐𝑟

-approximate

𝑐𝑟

5

Page 6: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

NNS algorithms

6

Exponential dependence on dimension [Arya-Mount’93], [Clarkson’94], [Arya-Mount-Netanyahu-

Silverman-We’98], [Kleinberg’97], [Har-Peled’02],[Arya-Fonseca-Mount’11],…

Linear dependence on dimension [Kushilevitz-Ostrovsky-Rabani’98], [Indyk-Motwani’98],

[Indyk’98, ‘01], [Gionis-Indyk-Motwani’99], [Charikar’02], [Datar-Immorlica-Indyk-Mirrokni’04], [Chakrabarti-Regev’04], [Panigrahy’06], [Ailon-Chazelle’06], [A.-Indyk’06], [A.-Indyk-Nguyen-Razenshteyn’14], [A.-Razenshteyn’15], [Pagh’16],…

Page 7: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Locality-Sensitive Hashing

Random hash function on satisfying:

for close pair: when is “high”

for far pair: when is “small”

Use several hash tables:

𝑞

𝑝

‖𝑞−𝑝‖

Pr [𝑔 (𝑞)=𝑔(𝑝)]

𝑟 𝑐𝑟

1

𝑃 1𝑃 2

, where

[Indyk-Motwani’98]

𝑞

“not-so-small”𝑃 1=¿𝑃 2=¿

𝜌=log 1/𝑃1

log 1/𝑃2

7

𝑝 ′

Page 8: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

LSH Algorithms

8

Space

Time

Exponent

Reference

[IM’98]

[IM’98, DIIM’04]

Hammingspace

Euclideanspace

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[AI’06]

LSH is tight.Are we really done with basic NNS

algorithms?

Page 9: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Detour: Closest Pair Problem

9

Closest Pair Problem (bichromatic) in : given , find at distance assuming there is a pair at distance

Via LSH: Preprocess , and then query each Gives time

[Alman-Williams’15]: exact in for

[Valiant’12, Karppa-Kaski-Kohonen’16]: Average case: for Worst-case:

𝑐=1+𝜖

Points: random at distance Close pair: random at distance

Much more efficient algorithms for or for average-case

Page 10: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Beyond Locality Sensitive Hashing

Space

Time

Exponent

Reference[IM’98]

[AI’06]

Hammingspace

Euclideanspace

[MNP’06, OWZ’11]

[MNP’06, OWZ’11]

[AR’15]

[AR’15]

LSH

LSH

10

complicated

[AINR’14]

complicated

[AINR’14]

Page 11: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

New approach?

11

A random hash function, chosen after seeing the given dataset

Efficiently computable

Data-dependent hashing

Page 12: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

A look at LSH lower bounds LSH lower bounds in Hamming space

Fourier analytic [Motwani-Naor-Panigrahy’06]

distribution over hash functions Far pair: random, distance Close pair: random at distance = Get

12

[O’Donnell-Wu-Zhou’11]

𝛿𝑑𝛿𝑑𝑐

𝜌 ≥1/𝑐 Points random at distance Close pair: distance at distance

Page 13: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Why not NNS lower bound? Suppose we try to apply [OWZ’11] to NNS

Pick random All the “far point” are concentrated in a small ball

of radius Easy to see at preprocessing: actual near neighbor

close to the center of the minimum enclosing ball

13

Page 14: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Construction of hash function

14

Two components: Average-case: nicer structure Reduction to such structure

Like a (very weak) “regularity lemma” for a set of points

has better LSHdata-dependent

Page 15: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Average-case: nicer structure

15

Think: random dataset on a sphere vectors perpendicular to each other s.t. two random points at distance

Lemma: via Cap Carving

𝑐𝑟

𝑐𝑟 /√2

[A.-Indyk-Nguyen-Razenshteyn’14]

Page 16: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Reduction to nice structure

16

Idea: iteratively decrease the radius of minimum enclosing ball

Algorithm: find dense clusters

with smaller radius large fraction of points

recurse on dense clusters apply cap carving on the

rest recurse on each “cap” eg, dense clusters might

reappear

radius =

*picture not to scale & dimension

radius =

Why ok?Why ok?• no dense clusters

• like “random dataset” with radius=

• even better!

Page 17: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Hash function

17

Described by a tree (like a hash table)

radius =

*picture not to scale&dimension

Page 18: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Dense clusters Current dataset: radius A dense cluster:

Contains points Smaller radius:

After we remove all clusters: For any point on the surface, there are at most

points within distance The other points are essentially orthogonal !

When applying Cap Carving with parameters : Empirical number of far pts colliding with query: As long as , the “impurity” doesn’t matter!

(√2−𝜖 )𝑅

trade-off

trade-off

?

Page 19: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Tree recap

19

During query: Recurse in all clusters Just in one bucket in CapCarving

Will look in >1 leaf! How much branching?

Claim: at most Each time we branch

at most clusters () a cluster reduces radius by cluster-depth at most

Progress in 2 ways: Clusters reduce radius CapCarving nodes reduce the # of far points (empirical )

A tree succeeds with probability

trade-off

Page 20: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Fast preprocessing

20

(√2−𝜖 )𝑅

How to find the dense clusters fast?

Step 1: reduce to time. Enough to consider centers that

are data points Step 2: reduce to near-linear time.

Down-sample! Ok because we want clusters of size After downsampling by a factor of , a cluster is still

somewhat heavy.

Page 21: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Even more details

21

Analysis: Instead of working with “probability of collision

with far point” , work with “empirical estimate” (the actual number)

A little delicate: interplay with “probability of collision with close point”, Empirical important only for the bucket where the

query falls into Need to condition: on collision with the close point 3-point collision analysis for LSH

In dense clusters, points may appear inside the balls whereas CapCarving works for points on the

sphere need to partition balls into thin shells (introduces

more branching)

Page 22: Optimal Data-Dependent Hashing for Nearest Neighbor Search Alex Andoni (Columbia University) Joint work with: Ilya Razenshteyn.

Data-dependent hashing: beyond LSH

22

A reduction from worst-case to average-case NNS with query , where

Better bounds? For dimension , can get better ! [Laarhoven’16]

like it is the case for closest pair [Alman-Williams’15] For : above is optimal even for data-dependent hashing! [A-

Razenshteyn’??]: in the right formalization (to rule out Voronoi diagram): description complexity of the hash function is

Better reduction? NNS for

[Indyk’98] gets approximation (poly space, sublinear qt) Matching lower bounds in the relevant model [ACP’08, KP’12]

Can be thought of as data-dependent hashing NNS for any norm (eg, matrix norms, EMD) ?Lower bounds: an opportunity

to discover new algorithmic techniques