Approximate Nearest Neighbors Search in High · PDF file3 Overview Overview •...

68
PAPERS Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbors: Towards Removing the Curse of Dimensionality, STOC 1998. Eyal Kushilevitz, Rafail Ostrovsky, Yuval Rabani: Efficient Search for Approximate Nearest Neighbor in High Dimensional Spaces. SIAM J. Comput., 2000. Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni: Locality-Sensitive Hashing Scheme Based on p-Stable Distributions. Symposium on Computational Geometry 2004. Alexandr Andoni, Mayur Datar, Nicole Immorlica, Vahab S. Mirrokni, P. Indyk: Locality-sensitive hashing using stable distributions. In: Nearest Neighbor Methods in Learning and Vision: Theory and Practice, 2006. Aneesh Sharma, Michael Wand Aneesh Sharma, Michael Wand CS 468 | Geometric Algorithms CS 468 | Geometric Algorithms Approximate Nearest Neighbors Search in High Dimensions and Locality-Sensitive Hashing Approximate Nearest Neighbors Search in High Dimensions and Locality-Sensitive Hashing

Transcript of Approximate Nearest Neighbors Search in High · PDF file3 Overview Overview •...

PAPERS

Piotr Indyk, Rajeev Motwani: Approximate Nearest Neighbors:Towards Removing the Curse of Dimensionality, STOC 1998.

Eyal Kushilevitz, Rafail Ostrovsky, Yuval Rabani: Efficient Search for ApproximateNearest Neighbor in High Dimensional Spaces. SIAM J. Comput., 2000.

Mayur Datar, Nicole Immorlica, Piotr Indyk, Vahab S. Mirrokni: Locality-Sensitive HashingScheme Based on p-Stable Distributions. Symposium on Computational Geometry 2004.Alexandr Andoni, Mayur Datar, Nicole Immorlica, Vahab S. Mirrokni, P. Indyk:

Locality-sensitive hashing using stable distributions.In: Nearest Neighbor Methods in Learning and Vision: Theory and Practice, 2006.

Aneesh Sharma, Michael WandAneesh Sharma, Michael WandCS 468 |Geometric AlgorithmsCS 468 |Geometric Algorithms

Approximate Nearest Neighbors Search

in High Dimensions

and Locality-Sensitive Hashing

Approximate Nearest Neighbors Search

in High Dimensions

and Locality-Sensitive Hashing

2

OverviewOverview

Overview

• Introduction

• Locality Sensitive Hashing (Aneesh)

• Hash Functions Based on p-Stable

Distributions (Michael)

3

OverviewOverview

Overview

• Introduction

• Nearest neighbor search problems

• Higher dimensions

• Johnson-Lindenstrauss lemma

• Locality Sensitive Hashing (Aneesh)

• Hash Functions Based on p-Stable

Distributions (Michael)

4

ProblemProblem

5

Problem StatementProblem Statement

Today’s Talks: NN-search in high dimensional spaces

• Given

• Point set P = {p1, …, pn}

• a query point q

• Find

• [ε-approximate] nearest neighbor to q from P

• Goal:

• Sublinear query time

• “Reasonable” preprocessing time & space

• “Reasonable” growth in d (exponential not acceptable)

6

Example Application: Feature spaces

• Vectors x∈Rd represent

characteristic features

of objects

• There are often many

features

• Use nearest neighbor

rule for classification /

recognition

mileage

ApplicationsApplications

??

top

sp

eed

SUV

sportscar

sedan

7

ApplicationsApplications

“Real World” Example: Image Completion

8

ApplicationsApplications

“Real World” Example: Image Completion

• Iteratively fill in pixels with best match

(+ multi scale)

• Typically 5 × 5 … 9 × 9 neighborhoods,

i.e.: dimension 25 … 81

• Performance limited by

nearest neighbor search

• 3D version: dimension 81 … 729

9

Higher DimensionsHigher Dimensions

10

Higher Dimensions are WeirdHigher Dimensions are Weird

Issues with High-Dimensional Spaces :

• d-dimensional space:

d independent neighboring

directions to each point

• Volume-distance ratio explodes

d = 1 d = 2 d = 3 d → ∞

vol(r) ∈∈∈∈ Θ(r d)

11

No Grid TricksNo Grid Tricks

Regular Subdivision Techniques Fail

• Regular k-grids contain kd cells

• The “grid trick” does not work

• Adaptive grids usually also

do not help

• Conventional integration

becomes infeasible (⇒ MC-approx.)

• Finite element function representation

become infeasible

k subdivisions

12

More Weird Effects:

• Dart-throwing anomaly

• Normal distributions

gather prob.-mass

in thin shells

• [Bishop 95]

• Nearest neighbor ~ farthest neighbor

• For unstructured points (e.g. iid-random)

• Not true for certain classes of structured data

• [Beyer et al. 99]

Higher Dimensions are WeirdHigher Dimensions are Weird

d = 1..200 d = 1..200

13

Johnson-LindenstraussLemma

Johnson-LindenstraussLemma

14

Johnson-Lindenstrauss LemmaJohnson-Lindenstrauss Lemma

JL-Lemma: [Dasgupta et al. 99]

• Point set P in Rd, n := #P

• There is f : Rd → Rk, k ∈ O(ε -2 ln n)

(k ≥ 4(ε 2/2 – ε 3/3)-1 ln n)

• …that preserves all inter-point distances

up to a factor of (1 + ε)

Random orthogonal linear projection

works with probability ≥≥≥≥ (1-1/n)

15

This means…This means…

What Does the JL-Lemma Imply?

Pairwise distances in small point set P

(sub-exponential in d)

can be well-preserved in low-dimensional

embedding

What does it not say?

Does not imply that the points themselves are

well-represented (just the pairwise distances)

16

ExperimentExperiment

17

IntuitionIntuition

Difference Vectors

• Normalize (relative error)

• Pole yields bad

approximation

• Non-pole area much

larger (high dimension)

• Need large number

of poles (exponential in d)

diff

good prj. bad prj.

no-go area

good prj.

uv

diff

18

OverviewOverview

Overview

• Introduction

• Locality Sensitive Hashing

• Approximate Nearest Neighbors

• Big picture

• LSH on unit hypercube

- Setup

- Main idea

- Analysis

- Results

• Hash Functions Based on p-Stable

Distributions

19

Approximate Nearest Neighbors

Approximate Nearest Neighbors

20

ANN: Decision versionANN: Decision version

Input: P, q, rOutput:

• If there is a NN, return yes and output one ANN

• If there is no ANN, return no

• Otherwise, return either

21

ANN: Decision versionANN: Decision version

Input: P, q, rOutput:

• If there is a NN, return yes and output one ANN

• If there is no ANN, return no

• Otherwise, return either

22

ANN: Decision versionANN: Decision version

Input: P, q, rOutput:

• If there is a NN, return yes and output one ANN

• If there is no ANN, return no

• Otherwise, return either

23

ANN: Decision versionANN: Decision version

General ANN

PLEB

Decision version + Binary search

c

24

LSH

Kd-tree

Vornoi

Preprocessing time

Space used

Query time

( )nnO log

ANN: previous resultsANN: previous results

( )nnO logρ ( )ρ+1nO ( )nnO log1 ρ+

( )nOd log2

( )nOd log2 ( )nO

( )2/dnO ( )2/d

nO

25

LSH: Big pictureLSH: Big picture

26

Locality Sensitive HashingLocality Sensitive Hashing

• Remember: solving decision ANN

• Input:

• No. of points: n

• Number of dimensions: d

• Point set: P

• Query point: q

27

LSH: Big PictureLSH: Big Picture

• Family of hash functions:

• Close points to same buckets

• Faraway points to different

buckets

• Choose a random function and hash P

• Only store non-empty buckets

28

LSH: Big PictureLSH: Big Picture

• Hash q in the table

• Test every point in q’sbucket for ANN

• Problem:

• q’s bucket may be empty

29

LSH: Big PictureLSH: Big Picture

• Solution:

• Use a number of hash tables!

• We are done if any ANN is found

30

LSH: Big PictureLSH: Big Picture

• Problem:

• Poor resolution à too many candidates!

• Stop after reaching a limit, small probability

31

LSH: Big PictureLSH: Big Picture

• Want to find a hash function:

• h is randomly picked from a family

• Choose

[ ][ ]

βαβ

α

>><≤=∉≥=∈

,

)()(Pr then ),( If

)()(Pr then ),( If

Rr

qhuhRqBu

qhuhrqBu

( )ε+= 1rR

32

LSH on unit Hypercube

LSH on unit Hypercube

33

Setup: unit hypercubeSetup: unit hypercube

• Points lie on hypercube: Hd = {0,1}d

• Every point is a binary string

• Hamming distance (r):

• Number of different coordinates

34

Setup: unit hypercubeSetup: unit hypercube

• Points lie on hypercube: Hd = {0,1}d

• Every point is a binary string

• Hamming distance (r):

• Number of different coordinates

35

Main ideaMain idea

36

Hash functions for hypercubeHash functions for hypercube

• Define family F:

• Intuition: compare a random coordinate

• Called:

( )( )

d

r

d

r

did

HdbbbibbihFh

bbbd

H d

)1(1 ,1

,,1for , ,,1)(:

,,point , Hypercube :Given 1

εβα +−=−=

=∈==∈

=

KK

K

( ) family sensitive-,),1(, βαε+rr

37

Hash functions for hypercubeHash functions for hypercube

• Define family G:

• Intuition: Compare k random coordinates

• Choose k later – logarithmic in n ß J-L lemma

{ } { }

k

k

k

k

d

r

d

r

Fi

hbk

hbhbgkdg

Gg

Fd

Hb

βεβαα =

+−=′=

−=′

=→

∈∈

)1(1 ,1

for ,)(,),(1)(1,01,0:

:

, :Given

K

38

Constructing hash tablesConstructing hash tables

• Choose uniformly at random from G

• Constructing hash tables, hash P

• Will choose later

τgg ,,1K

τ

1g2g τg

τ

39

LSH: ANN algorithmLSH: ANN algorithm

• Hash q into each

• Check colliding nodes for ANN

• Stop if more than collisions, return fail

τgg ,,1K

1g2g τg

τ4

40

Details…Details…

41

Choosing parametersChoosing parameters

• Choose k and to ensure constant probability of:

• Finding an ANN if there is a NN

• Few collisions when there is no ANN

τ

)4( τ<

ρτβ

βαρ

n2 , 1/ln

nln k :Choose

1/ln

1/ln :Define

==

=

42

Analysis of LSHAnalysis of LSH

• Probability of finding an ANN if there is a NN

• Consider a point and a hash function Gi

g ∈

[ ]

)()(Pr

1/ln

1/ln

1/ln

nln

ρ

βα

βα

α

==

=

≥=

n

n

qgpgk

ii

),( rqBp ∈

43

Analysis of LSHAnalysis of LSH

• Probability of finding an ANN if there is a NN

• Consider a point and a hash function Gi

g ∈),( rqBp ∈

[ ][ ] ( )

5

4

/11

11 tablesin onceleast at collide and Pr

1locationsdifferent to and hashes Pr

2

>

−≥−−≥

−≤−

e

nqp

nqpgiτρ

ρ

τ

44

Analysis of LSHAnalysis of LSH

• Probability of collision if there is no ANN

• Consider a point and a hash function Gg ∈

[ ]

n

qgpgk

1

1/ln

nln ).ln(exp

)()(Pr

=

=

≤=

ββ

β

( ))1(, ε+∉ rqBp

45

Analysis of LSHAnalysis of LSH

• Probability of collision if there is no ANN

• Consider a point and a hash function Gg ∈( ))1(, ε+∉ rqBp

[ ][ ]

[ ]

[ ]4

3collisions 4 Pr

4

1

4collisions 4 Pr

tablesin with collisionsE

1 tableain with collisionsE

≥<

=≤≥

≤≤

ττ

ττ

ττq

q

46

ResultsResults

47

Complexity of LSHComplexity of LSH

• Given:

• Can answer Decision-ANN with:

• Show:

( ) Hypercubefor family sensitive-,),1(, βαε+rr

query time

space 1

++

ρ

ρ

dnO

ndnO

εεβαρ

+≤

+−

==

1

1

d

)r(11ln

d

r-1ln

1/ln

1/ln

48

Complexity of LSHComplexity of LSH

• Given:

• Can answer Decision-ANN with:

( ) Hypercubefor family sensitive-,),1(, βαε+rr

query time )1/(1

space )1/(11

+

+++

ε

ε

dnO

ndnO

49

Complexity of LSHComplexity of LSH

• Can amplify success probability

• Build structures

• Can answer Decision-ANN with:

query time log)1/(1

space log)1/(11

+

+++

ndnO

nndnO

ε

ε

( )nO log

50

Complexity of LSHComplexity of LSH

• Can answer ANN on the Hypercube:

• Build structures with

query time log1log)1/(1

space 2log1)1/(11

−+

−+++

ndnO

nndnO

εε

εε

−nO log1ε i

ir )1( ε+=

51

LSH - SummaryLSH - Summary

• Randomized Monte-Carlo algorithm for ANN

• First truly sub-linear query time for ANN

• Need to examine only logarithmic number of coordinates

• Can be extended to any metric space if we can find a hash function for it!

• Easy to update dynamically

• Can reduce ANN in Rd to ANN on hypercube

52

OverviewOverview

Overview

• Introduction

• Locality Sensitive Hashing

• Hash Functions Based on p-Stable

Distributions

• The basic idea

• The details (more formal)

• Analysis, experimental results

53

LSH by Random ProjectionsLSH by Random Projections

Idea:

• Hash function is a projection to a line

of random orientation

• One composite hash function is a random grid

• Hashing buckets are grid cells

• Multiple grids are used for prob. amplification

• Jitter grid offset randomly (check only one cell)

• Double hashing: Do not store empty grid cells

54

LSH by Random ProjectionsLSH by Random Projections

Basic Idea:

55

LSH by Random ProjectionsLSH by Random Projections

Questions:

• What distribution should be used for the

projection vectors?

• What is a good bucket size?

• Local sensitivity:

• How many lines per grid?

• How many hash grids overall?

• Depends on sensitivity (as explained before)

• How efficient is this scheme?

56

The DetailsThe Details

57

p-Stable Distributionsp-Stable Distributions

Distribution for the Projection Vectors:

• Need to make the projection process formally

accessible

• Mathematical tool: p-stable distributions

58

p-Stable Distributionsp-Stable Distributions

p-Stable Distributions:

A prob. distribution D is called p-stable :⇔• For any v1, … ,vn ∈ R• And i.i.d. random variables X1, … ,Xn ~ D

ΣviXi has the same distribution as Σ |vi |p

1/pX

where X ~ D

i i

59

Gaussian DistributionGaussian Distribution

Gaussian Normal Distributions are 2-stable

x1

x2

60

Other distributions:

• Cauchy distribution

is 1-stable

(must have infinite variance

so that the central limit theorem is not violated)

• Distributions exists for p ∈ (0,2]

• No closed form, but can be sampled

• Sampling sufficient for LSH-algorithm

More General DistributionsMore General Distributions

)1(

12x+π

61

ProjectionProjection

Projection Algorithm:

• Chose p according to metric of the space lp

• Compute vector with entries according to a

p-stable distribution

[for example: Gaussian noise entries]

• Each vector vi yields a hash function hi

• Compute:

+

=r

bxvxh

i

i

,)(

random value

∈ [0…r ]

bucket size

62

ln

ln

βα=ρ

Locality Sensitive HFLocality Sensitive HF

Locality Sensitive Hash Functions

H = {h: S → U} is (r1, r2, α, β)-sensitive :⇔

v ∈ B(q, r1) ⇒ Pr(collision(p,q)) ≥ αv ∉ B(q, r2) ⇒ Pr(collision(p,q)) ≤ β

Performance

(O(dn + n1+ρ) space, O(dnρ) query time)

63

Locality SensitivityLocality Sensitivity

Computing the Locality “Sensitivity”

Distance c = ||v1 - v2||p

cX-distributed, X from p-stable distr.

The constructed family of hash functions is

(r1, r2, α, β)-sensitive for

α = p(1), β = p(c), r2/r1 = c

dtr

t

c

tf

ccollision

buckethit

r

densityabs

p

cp 43421321

44 344 21

= ∫=

11

)Pr(

0

.

)(:

fp

t

+

=r

bxvxh

i

i

,)(

64

Numerical ComputationNumerical Computation

, O(dn + n1+ρ) space, O(dnρ) query time

[Datar et al. 04] [Datar et al. 04]l1 l2

Numerical result: ρ ~ 1/c = 1/(1+ε)

=ρln

ln

βα

65

Numerical ComputationNumerical Computation

[Datar et al. 04] [Datar et al. 04]l1 l2

Width Parameter r

• Intuitively: In the range of ball radius

• Num. result: not too small (too large increases k)

• Practice: factor 4 (E2LSH manual)

66

Experimental Results

Experimental Results

67

LSH vs. ANNLSH vs. ANN

Comparison with ANN (Mount, Arya, kD/BBD-trees)

MNIST handwritten digits, 60000×282 pix (d=784)

[Datar et al. 04]

68

LSH vs. ANNLSH vs. ANN

Remarks:

• ANN with c = 10 is comparably fast and 65% correct,

but there are no guarantees [Indyk]

• LSH needs more memory:

1.2GB vs. 360MB [Indyk]

• Empirically, LSH shows linear performance when

forced to use linear memory [Goldstein et al. 05]

• Benchmark searches only for points in the data set,

LSH is much slower for negative results

[Goldstein et al. 05, report ~1.5 ord. of mag.]