Similarity Search in High Dimensions via Hashing
description
Transcript of Similarity Search in High Dimensions via Hashing
Similarity Search in High Dimensions via Hashing
Aristides Gionis, Piotr Indyk, Rajeev Motwani
Presented by:
Fatih Uzun
Outline
• Introduction
• Problem Description
• Key Idea
• Experiments and Results
• Conclusions
Introduction
• Similarity Search over High-Dimensional Data– Image databases, document collections etc
• Curse of Dimensionality– All space partitioning techniques degrade to linear
search for high dimensions
• Exact vs. Approximate Answer– Approximate might be good-enough and much-faster
– Time-quality trade-off
Problem Description
- Nearest Neighbor Search ( - NNS)– Given a set P of points in a normed space , preprocess P
so as to efficiently return a point p P for any given query point q, such that
• dist(q,p) (1 + ) min r P dist(q,r)
• Generalizes to K- nearest neighbor search ( K >1)
Problem Description
Key Idea
• Locality Sensitive Hashing ( LSH ) to get sub-linear dependence on the data-size for high-dimensional data
• Preprocessing : – Hash the data-point using several LSH
functions so that probability of collision is higher for closer objects
Algorithm : Preprocessing
• Input – Set of N points { p1 , …….. pn }– L ( number of hash tables )
• Output– Hash tables Ti , i = 1 , 2, …. L
• Foreach i = 1 , 2, …. L– Initialize Ti with a random hash function gi(.)
• Foreach i = 1 , 2, …. LForeach j = 1 , 2, …. N
Store point pj on bucket gi(pj) of hash table Ti
LSH - Algorithm
g1(pi) g2(pi) gL(pi)
TLT2T1
pi
P
Algorithm : - NNS Query
• Input – Query point q
– K ( number of approx. nearest neighbors )
• Access – Hash tables Ti , i = 1 , 2, …. L
• Output– Set S of K ( or less ) approx. nearest neighbors
• S
Foreach i = 1 , 2, …. L
– S S { points found in gi(q) bucket of hash table Ti }
• Family H of (r1, r2, p1, p2)-sensitive functions, {hi(.)} – dist(p,q) < r1 ProbH [h(q) = h(p)] p1
– dist(p,q) r2 ProbH [h(q) = h(p)] p2 – p1 > p2 and r1 < r2
• LSH functions: gi(.) = { h1(.) …hk(.) } • For a proper choice of k and l, a simpler problem, (r,)-
Neighbor, and hence the actual problem can be solved
• Query Time : O(d n[1/(1+)] )– d : dimensions , n : data size
LSH - Analysis
Experiments• Data Sets
– Color images from COREL Draw library (20,000 points,dimensions up to 64)
– Texture information of aerial photographs (270,000 points, dimensions 60)
• Evaluation– Speed, Miss Ratio, Error (%) for various data sizes,
dimensions, and K values
– Compare Performance with SR-Tree ( Spatial Data Structure )
Performance Measures
• Speed– Number of disk block accesses in order to answer the
query ( # hash tables)
• Miss Ratio– Fraction of cases when less than K points are found for
K-NNS
• Error– Average of fractional error in distance to point found
by LSH as compared to nearest neighbor distance taken over entire set of queries
Speed vs. Data SizeApproximate 1 - NNS
0
2
4
6
8
10
12
14
16
18
20
0 5000 10000 15000 20000
Number of Database Points
Dis
k A
cc
es
se
s LSH, error = 0.2
LSH, error = 0.1
LSH, error = 0.05
LSH, error =0.02
SR-Tree
Speed vs. DimensionApproximate 1-NNS
0
2
4
6
8
10
12
14
16
18
20
0 20 40 60 80
Dimensions
Dis
k A
cces
ses LSH , Error = 0.2
LSH, Error = 0.1
LSH, Error = 0.05
LSH, Error = 0.02
SR- Tree
Speed vs. Nearest NeighborsApproximate K-NNS
0
2
4
6
8
10
12
14
16
0 20 40 60 80 100 120
Number of Nearest Neighbors
Dis
k A
cc
es
se
s
LSH, Error 0.2
LSH, Error 0.1
LSH, Error 0.05
Speed vs. Error
0
50
100
150
200
250
300
350
400
450
10 20 30 40 50
Error ( % )
Dis
k A
cces
ses
SR-Tree
LSH
Miss Ratio vs. Data SizeApproximate 1 -NNS
0
0.05
0.1
0.15
0.2
0.25
0 5000 10000 15000 20000
Number of Database Points
Mis
s R
atio
Error = 0.1
Error = 0.05
Conclusion
Better Query Time than Spatial Data Structures
Scales well to higher dimensions and larger data size ( Sub-linear dependence )
Predictable running timeExtra storage over-head Inefficient for data with distances concentrated around average
Future Work
• Investigate Hybrid-Data Structures obtained by merging tree and hash based structures.
• Make use of the structure of the data-set to systematically obtain LSH functions
• Explore other applications of LSH-type techniques to data mining
Questions ?