Summer School on Hashing’14
Locality Sensitive Hashing
Alex Andoni(Microsoft Research)
Nearest Neighbor Search (NNS)
• Preprocess: a set of points
• Query: given a query point , report a point with the smallest distance to
𝑞𝑝
Approximate NNS• r-near neighbor: given a new
point , report a point s.t.
• Randomized: a point returned with 90% probability
c-approximate
𝑐𝑟if there exists apoint at distance
q
r p
cr
Heuristic for Exact NNS• r-near neighbor: given a new
point , report a set with• all points s.t. (each with 90%
probability)
• may contain some approximate neighbors s.t.
• Can filter out bad answers
q
r p
cr
c-approximate
Locality-Sensitive Hashing• Random hash function on
s.t. for any points :• Close when is “high” • Far when is “small”
• Use several hashtables
q
p
𝑑𝑖𝑠𝑡 (𝑝 ,𝑞)
Pr [𝑔 (𝑝 )=𝑔 (𝑞)]
𝑟 𝑐𝑟
1
𝑃 1𝑃 2
: , where
[Indyk-Motwani’98]
q
“not-so-small”𝑃 1=¿𝑃 2=¿
𝜌=log 1/𝑃1
log 1/𝑃2
Locality sensitive hash functions
6
• Hash function is usually a concatenation of “primitive” functions:
• Example: Hamming space • , i.e., choose bit for a random • chooses bits at random•
Full algorithm
7
• Data structure is just hash tables:• Each hash table uses a fresh random function • Hash all dataset points into the table
• Query:• Check for collisions in each of the hash tables• until we encounter a point within distance
• Guarantees:• Space: , plus space to store points• Query time: (in expectation)• 50% probability of success.
Analysis of LSH Scheme
8
• Choice of parameters ?• hash tables with
• Pr[collision of far pair] = • Pr[collision of close pair] = • Hence “repetitions” (tables) suffice!
distance
colli
sion
prob
abili
ty
𝑟 𝑐𝑟
𝑃1
𝑃2
𝑃12
𝑃22
set s.t.=
=
𝑘=1𝑘=2
Analysis: Correctness
9
• Let be an -near neighbor• If does not exists, algorithm can output
anything• Algorithm fails when:
• near neighbor is not in the searched buckets • Probability of failure:
• Probability do not collide in a hash table: • Probability they do not collide in hash tables at
most
Analysis: Runtime
10
• Runtime dominated by:• Hash function evaluation: time• Distance computations to points in buckets
• Distance computations:• Care only about far points, at distance • In one hash table, we have
• Probability a far point collides is at most • Expected number of far points in a bucket:
• Over hash tables, expected number of far points is
• Total: in expectation
NNS for Euclidean space
11
• Hash function is a concatenation of “primitive” functions:
• LSH function :• pick a random line , and quantize• project point into
• is a random Gaussian vector• random in • is a parameter (e.g., 4)
[Datar-Immorlica-Indyk-Mirrokni’04]
𝑝ℓ
• Regular grid → grid of balls• p can hit empty space, so take
more such grids until p is in a ball
• Need (too) many grids of balls
• Start by projecting in dimension t
• Analysis gives• Choice of reduced dimension
t?• Tradeoff between
• # hash tables, n, and• Time to hash, tO(t)
• Total query time: dn1/c2+o(1)
Optimal* LSH
2D
p
pRt
[A-Indyk’06]
x
Proof idea• Claim: , i.e.,
• P(r)=probability of collision when ||p-q||=r
• Intuitive proof:• Projection approx preserves distances
[JL] • P(r) = intersection / union• P(r)≈random point u beyond the
dashed line• Fact (high dimensions): the x-
coordinate of u has a nearly Gaussian distribution
→ P(r) exp(-A·r2)
pqr
q
P(r)
u
p
LSH Zoo
14
• Hamming distance• : pick a random coordinate(s)
[IM98]• Manhattan distance:
• : random grid [AI’06]• Jaccard distance between sets:
• : pick a random permutation on the universe
min-wise hashing [Bro’97,BGMZ’97]• Angle: sim-hash [Cha’02,…]
To be or not to be
To Simons or not to Simons
…21102…
be toornot
Sim
ons
…01122…
be toornot
Sim
ons
…11101… …01111…
{be,not,or,to} {not,or,to, Simons}
1 1
not
not
=be,to,Simons,or,not
be to
LSH in the wild
15
• If want exact NNS, what is ?• Can choose any parameters • Correct as long as • Performance:
• trade-off between # tables and false positives• will depend on dataset “quality”• Can tune to optimize for given dataset
𝐿
𝑘
safety not guaranteed
fewer false positives
fewer tables
Time-Space Trade-offs
[AI’06]
[KOR’98, IM’98, Pan’06]
[Ind’01, Pan’06]
Space Time Comment Reference
[DIIM’04, AI’06]
[IM’98]
querytime
space
medium medium
lowhigh
highlow
ω(1) memory lookups [AIP’06]
ω(1) memory lookups [PTW’08, PTW’10]
[MNP’06, OWZ’11]
1 mem lookup
LSH is tight…
leave the rest to cell-probe lower bounds?
Data-dependent Hashing!• NNS in Hamming space () with query
time, space and preprocessing for
• optimal LSH: of [IM’98]• NNS in Euclidean space () with:
• optimal LSH: of [AI’06]
18
+𝑂( 1𝑐3 /2 )+𝑜(1)
+𝑂( 1𝑐3 )+𝑜(1)
[A-Indyk-Nguyen-Razenshteyn’14]
A look at LSH lower bounds• LSH lower bounds in Hamming space
• Fourier analytic• [Motwani-Naor-Panigrahy’06]
• distribution over hash functions • Far pair: random, distance • Close pair: random at distance = • Get
19
[O’Donnell-Wu-Zhou’11]
𝜖 𝑑𝜖 𝑑𝑐
𝜌 ≥1/𝑐
Why not NNS lower bound?• Suppose we try to generalize [OWZ’11]
to NNS• Pick random • All the “far point” are concentrated in a small
ball of radius • Easy to see at preprocessing: actual near
neighbor close to the center of the minimum enclosing ball
20
Intuition• Data dependent LSH:
• Hashing that depends on the entire given dataset!
• Two components:• “Nice” geometric configuration with • Reduction from general to this “nice”
geometric configuration
21
Nice Configuration: “sparsity”• All points are on a sphere of radius
• Random points are at distance
• Lemma 1: • “Proof”:
• Obtained via “cap carving”• Similar to “ball carving” [KMS’98, AI’06]
• Lemma 1’: for radius = 22
Reduction: into spherical LSH• Idea: apply a few rounds of “regular” LSH
• Ball carving [AI’06]• as if target approx. is 5c => query time
• Intuitively:• far points unlikely to collide• partitions the data into buckets of small
diameter • find the minimum enclosing ball• finally apply spherical LSH on this ball!
23
Two-level algorithm• hash tables, each with:
• hash function • ’s are “ball carving LSH” (data independent)• ’s are “spherical LSH” (data dependent)
• Final is an “average” of from levels 1 and 2h1 (𝑝 ) ,…,h𝑙(𝑝)
𝑠1 (𝑝 ) ,…, 𝑠𝑚(𝑝)
Details• Inside a bucket, need to ensure “sparse”
case• 1) drop all “far pairs”• 2) find minimum enclosing ball (MEB)• 3) partition by “sparsity” (distance from
center)
25
1) Far points• In level 1 bucket:
• Set parameters as if looking for approximation
• Probability a pair survives:• At distance : • At distance : • At distance :
• Expected number of collisions for distance is constant
• Throw out all pairs that violate this
26
2) Minimum Enclosing Ball• Use Jung theorem:
• A pointset with diameter has a MEB of radius
27
3) Partition by “sparsity”• Inside a bucket, points are at distance at most
• “Sparse” LSH does not work in this case• Need to partition by the distance to center• Partition into spherical shells of width 1
28
Practice of NNS
29
• Data-dependent partitions…• Practice:
• Trees: kd-trees, quad-trees, ball-trees, rp-trees, PCA-trees, sp-trees…
• often no guarantees
• Theory?• assuming more about data: PCA-like
algorithms “work” [Abdullah-A-Kannan-Krauthgamer’14]
ℓ
Finale• LSH: geometric hashing• Data-dependent hashing can do better!
• Better upper bound? • Multi-level improves a bit, but not too much• for ?
• Best partitions in theory and practice?• LSH for other metrics:
• Edit distance, Earth-Mover Distance, etc• Lower bounds (cell probe?)
30
Open question:• Practical variant of ball-carving hashing?• Design space-partition of that is
• efficient: point location in poly(t) time• qualitative: regions are “sphere-like”
[Prob. needle of length 1 is not cut]
[Prob needle of length c is not cut]≥
1/c2
𝑝
Top Related