Locality Sensitive Hashing Basics and applications.

Locality Sensitive Hashing

Basics and applications

A well-known problem

Given a large collection of documents Identify the near-duplicate documents

Web search engines Proliferation of near-duplicate documents

Legitimate – mirrors, local copies, updates, … Malicious – spam, spider-traps, dynamic URLs, …

30% of web-pages are near-duplicates

[1997]

Natural Approaches

Fingerprinting: only works for exact matches Karp Rabin (rolling hash) – collision probability

guarantees MD5 – cryptographically-secure string hashes

Edit-distance metric for approximate string-matching expensive – even for one pair of documents impossible – for billion web documents

Random Sampling sample substrings (phrases, sentences, etc) hope: similar documents similar samples But – even samples of same document will differ

Basic Idea: Shingling [Broder 1997]

dissect document into q-grams (shingles)

T = I leave and study in Pisa, ….

If we set q=3 the 3-grams are:<I leave and><leave and study><and study in><study in

Pisa>…

represent documents by sets of hash[shingle]

The problem reduces to set intersection among int

Basic Idea: Shingling [Broder 1997]

Set intersection Jaccard similarity

DocB SB

SADocA

• Claim: A & B are near-duplicates if sim(SA,SB) is high

BABA SS

SS )S,sim(S

Sketching of a document

From each shingle-set we build a “sketch vector” (~200

Postulate: Documents that share ≥ t components of their sketch-vectors are claimed to be near duplicates

Sec. 19.6

Sketching by Min-Hashing

Consider SA, SB P = {0,…,p-1}

Pick a random permutation π of the whole set P (such as ax+b mod p)

Pick the minimal element of SA : = min{π(SA)}

Pick the minimal element of SB : = min{π(SB)}

Lemma:

SS β]P[α

Strengthening it…

Similarity sketch sk(A) d minimal elements under π(SA) Or take d permutations and the min of each

Note: we can reduce the variance by using a larger d

Typically d is few hundred mins (~200)

Computing Sketch[i] for Doc1

Document 1

Start with 64-bit f(shingles)

Permute with i

Pick the min value

Sec. 19.6

Test if Doc1.Sketch[i] = Doc2.Sketch[i]

Document 1 Document 2

Are these equal?

Use 200 random permutations (minimum), and thus create one 200-dim vector per document and evaluate the fraction of shared components

Sec. 19.6

Claim: This happens with probability Size_of_intersection / Size_of_union

It’s even more difficult…

So we have squeezed few Kbs of data

(web page) into few hundred bytes

But you still need a brute-force

comparison (quadratic time) to compute all

nearly-duplicate documents This is too much even if it is executed in RAM

Locality Sensitive Hashing

The case of the Hamming distance

How to compute fast the fraction of different compo. in d-dim vectors

How to compute fast the hamming distance between d-dim vectors

Fraction different components = HammingDist/d

A warm-up

Consider the case of binary (sketch) vectors, thus living in the hypercube {0,1}d

Hamming distance

D(p,q)= # coord on which p and q differ

Define hash function h by choosing a set I of k random coordinates

h(p) = p|I = projection of p on I

Example: if p=01011 (d=5), a pick I={1,4} (with k=2), then h(p)=01 Note similarity with the Bloom Filter

A key property

Pr[to pick an equal component]= (d - D(p,q))/d

We can vary the probability by changing k

k=1 k=2

distance distance

1 2 …. d

qpDqhph )

),(1()]()(Pr[

What about FalseNegatives ?

Reiterate

Repeat L times the k-projections hi

Declare a «match» if at least one hi matches

Example: d=5, k=2, p = 01011 and q = 00101

•I1 = {2,4}, we have h1(p) = 11 and h1(q)=00

•I2 = {1,4}, we have h2(p) = 01 and h2(q)=00

•I3 = {1,5}, we have h3(p) = 01 and h3(q)=01

We set g( ) = < h1( ), h2( ), h3( )>

p and qmatch !!

Measuring the error prob.

The g() consists of L independent hashes hi

Pr[g(p) matches g(q)] =1 - Pr[hi(p) ≠ hi(q), i=1, …, L]

qpDqgpg

),(1(11)()(Pr

qpDqhph )

),(1()]()(Pr[

Lksqgpg 11)()(Prs

(1/L)^(1/k)

Find groups of similar items SOL 1: Buckets provide the candidate similar items

«Merge» similar sets if they share items

h1(p) h2(p) hL(p)

TLT2T1

pPoints in a bucket are possibly similar objects

Find groups of similar items

SOL 1: Buckets provide the candidate similar items

SOL 2: Sort items by the hi(), and pick as similar candidate

the equal ones Repeat L times, for all hi()

«Merge» candidate sets if they share items.

What about clustering ?

Check candidates !!!

LSH versus K-means

What about optimality ? K-means is locally optimal [recently, some researchers showed how to introduce some guarantee]

What about the Sim-cost ? K-means compares items in (d) time and space [notice that d may be bi/millions]

What about the cost per iteration and their number? Typically K-means requires few iterations, each costs K n time: I K n d

What about K ? In principle have to iterate K=1,…, n

LSH needs sort(n) time hence, on disk, few passes over the data and with

guaranteed error bounds

Also on-line query

Given a query q, check the buckets of hj(q) for j=1,…,

h1(q) h2(q) hL(q)

TLT2T1

Locality Sensitive Hashingand its applications

Locality Sensitive Hashing Basics and applications.

Documents

Transcript of Locality Sensitive Hashing Basics and applications.

Summer School on Hashing’14 Locality Sensitive Hashing

Research Article Fast Image Search with Locality-Sensitive ...downloads.hindawi.com/journals/tswj/2015/350676.pdfResearch Article Fast Image Search with Locality-Sensitive Hashing

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

Efï¬cient Online Locality Sensitive Hashing via Reservoir Counting

MapReduce Based Personalized Locality Sensitive … Based Personalized Locality Sensitive Hashing for ... end-to-endset-similarity join algorithm [12], fast computation of ... minhashing

MapReduce Based Personalized Locality Sensitive Hashing ......Keywords: Locality Sensitive Hashing, MapReduce, Similarity Joins 1 Introduction A fundamental problem in data mining

Locality sensitive hashing

Use of Locality Sensitive Hashing (LSH) Algorithm to Match ...ceur-ws.org/Vol-1823/paper3.pdf · Use of Locality Sensitive Hashing (LSH) Algorithm to Match Web of Science and SCOPUS

Locality-Sensitive Hashinggrauman/courses/spring2008/slides/Marc_Demo.… · Locality-Sensitive Hashing CS 395T: Visual Recognition and Search Marc Alban. Feb 22, ... LSH Matlab Toolbox

1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.

Bi-level Locality Sensitive Hashing for K-Nearest …gamma.cs.unc.edu/KNN/bilevel.pdf · · 2011-07-04Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation Jia

1 Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing.

A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

Lecture 12: Locality-Sensitive Hashing and MinHash

Locality Sensitive Hashing By Spark

Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)

BOOSTED LOCALITY SENSITIVE HASHING: DISCRIMINATIVE …€¦ · ations. To this end, we start from locality sensitive hashing (LSH), which is to construct hash functions such that

Finding similar items in high dimensional spaces locality sensitive hashing

Large-Scale Distributed Locality-Sensitive Hashing for General …eduardovalle.com/wordpress/wp-content/uploads/2014/10/... · 2014-10-24 · Large-Scale Distributed Locality-Sensitive

Locality Sensitive Hashing- A Comparison of Hash Function Types and Querying Mechanism (2010)