Similarity Search in High Dimensions via Hashing
-
Upload
maruf-aytekin -
Category
Engineering
-
view
146 -
download
2
Transcript of Similarity Search in High Dimensions via Hashing
![Page 1: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/1.jpg)
Advanced Topics in Artificial Intelligence
Similarity Search in High Dimensions via Hashing
Aristides Gionis, Piotr Indyky, Rajeev Motwaniz
Presenter
Maruf AytekinPhD Student
Computer Engineering DepartmentBahcesehir University
Apr 21, 2015
![Page 2: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/2.jpg)
Outline• LSH • Locality-Sensitive Functions • Banding Technique • LSH Families for Cosine • Applications of LSH • Conclusion
![Page 3: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/3.jpg)
LSHOne general approach to LSH
• “Hash” items several times, in such a way that similar items are more likely to be hashed to the same bucket than dissimilar items are.
• We then consider any pair that hashed to the same bucket for any of the hashings to be a candidate pair.
• We check only the candidate pairs for similarity.
![Page 4: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/4.jpg)
LSH• Most of the dissimilar pairs will never hash to the same
bucket, and therefore will never be checked. • Those dissimilar pairs that do hash to the same bucket are
false positives: a small fraction of all pairs. • We also hope that most of the truly similar pairs will hash to
the same bucket under at least one of the hash functions. • Those that do not are false negatives; only a small fraction of
the truly similar pairs.
![Page 5: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/5.jpg)
Locality-Sensitive FunctionsIn many cases, the function f will “hash” items, and the
decision will be based on whether or not the result is equal.
• f(x) = f(y) to mean that f(x,y) is “yes; make x and y a
candidate pair.”
• f(x) ≠ f(y) to mean “do not make x and y a candidate pair.”
A collection of functions of this form will be called a family of
functions.
![Page 6: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/6.jpg)
Locality-Sensitive FunctionsLet d1 < d2 be two distances according to some distance
measure d. A family F of functions is said to be (d1, d2, p1, p2)-sensitive if for every f in F:
1. If d(x, y) ≤ d1, then the probability that f(x) = f(y) is at
least p1.
2. If d(x, y) ≥ d2, then the probability that f(x) = f(y) is at
most p2.
![Page 7: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/7.jpg)
Locality-Sensitive Functions
Behavior of a (d1, d2, p1, p2)-sensitive function
• d1 and d2 can be made as close possible
• The penalty is that p1 and p2 becomes close as well.
![Page 8: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/8.jpg)
Banding TechniqueAn effective way to choose the hashings is to divide the signature matrix into b bands consisting of r rows each.
Dividing a signature matrix into four bands of three rows per band
![Page 9: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/9.jpg)
Analysis of the Banding Technique
The probability that the signatures becomes candidate pair at least one band: 1 − (1 − s r ) b
This function has the form of an S-curve:
The threshold (the value of similarity s) at which the probability of becoming a candidate is 1/2, is a function of b and r (b = 16, r = 4).
![Page 10: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/10.jpg)
Analysis of the Banding Technique
Values of the S-curve for b = 20 and r = 5
![Page 11: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/11.jpg)
Analysis of the Banding Technique
• Choose a threshold t that defines how similar items have to be in order for them to be “candidate pair.”
• Pick b and r such that br = n, and the threshold t is approximately (1/b)1/r.
• If avoiding false negatives is important, select b and r to produce a threshold lower than t.
• if speed is important and you wish to limit false positives, select b and r to produce a higher threshold.
![Page 12: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/12.jpg)
LSH for CosineLet u be user u's rating vector and v be user v's rating vector and r is a random generated vector. The family of hash functions H:
, where
which shows the probability of u and v being declared as a candidate pair.
![Page 13: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/13.jpg)
LSH for CosineA new family G of hash functions g is defined, where each function g is obtained by concatenating (AND) functions of h1, h2, , ...., hr from family of functions F:
g(t) = [h1(t),........, hr(t)].
We then generate random functions of g(t) for each band (hash table) and construct b hash tables.
![Page 14: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/14.jpg)
LSH for CosineExample: r1 = [-1, 1,1,-1,-1]
r2 = [1, 1,1,-1,-1]
r3 = [-1, -1,1,-1,1]
r4 = [-1, 1, -1,1, -1]
u1.r1 = -6 => hr1(u1) = 0
u1.r2 = 4 => hr2(u1) = 1
u1.r3 = -12 => hr3(u1) = 0
u1.r4 = 2 => hr4(u1) = 1
u1 = [5, 4, 0, 4, 1] u2 = [2, 1, 1, 1, 4] u3 = [4, 3, 0, 5, 2] u4 = [2, 1, 2, 1, 4]
g(u1) = 0101
g(u2) = 0010 g(u3) = 0101 g(u4) = 0110
g(u1) = 0101
![Page 15: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/15.jpg)
Applications of LSH• Near neighbor search • Entity Resolution • Matching Fingerprints • Matching Newspaper Articles
![Page 16: Similarity Search in High Dimensions via Hashing](https://reader035.fdocuments.in/reader035/viewer/2022062316/58abaac91a28abdf3c8b607f/html5/thumbnails/16.jpg)
Thank You
Q & A