A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

A GENTLE INTRODUCTION TO APACHE SPARK AND LOCALITY-SENSITIVE

HASHING1

FRANCOIS GARILLOT(FORMERLY) TYPESAFE

francois@garillot.net

@huitseeker

LOCALITY-SENSITIVE HASHING

▸ A story : Why LSH▸ How it works & hash families

▸ LSH distribution▸ Beware : WIP

SPARK TENETS

▸ broadcast variables▸ per-partition commands▸ shuffle sparsely

SEGMENTATION

▸ small sample: 289421 users▸ larger sample : 5684403 users

46K websites, ultimately users4 personal laptops, 4 provided laptops

K-MEANS COMPLEXITY

Find with the 'elbow method' on within-cluster sum of squares. Then

EM - GAUSSIAN MIXTURE

With dimensions, mixtures,

LOCALITY-SENSITIVE HASHING FUNCTIONSA family H of hashing functions is -sensitive if:

▸ if then ▸ if then

DISTANCES ! (THOSE AND MANY OTHER)

▸ Hamming distance : where is arandomly chosen index

▸ Jaccard :

▸ Cosine distance:

EARTH MOVER'S DISTANCE

Find optimal F minimizing:

A WORD ON MODULARITY

LSH for EMD introduced by Charikar in the Simhash paper (2002).

Yet no place to plug your LSH family in implementation (e.g. scikit, mrsqueeze) !

LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL

▸ basic LSH:

▸ AND (series) construction: ▸ OR (parallel) construction :

BASIC LSH val hashCollection = records.map(s => (getId(s), s)). mapValues(s => getHash(s, hashers)) val subArray = hashCollection.flatMap { case (recordId, hash) => hash.grouped(hashLength / numberBands).zipWithIndex.map{ case (band, bandIndex) => (bandIndex, (band, sentenceId)) } }

LOOKUPdef findCandidates(record: Iterable[String], hashers: Array[Int => Int], mBands: BandType) = { val hash = getHash(record, hashers) val subArrays = partitionArray(hash).zipWithIndex

subArrays.flatMap { case (band, bandIndex) => val hashedBucket = mBands.lookup(bandIndex). headOption. flatMap{_.get(band)} hashedBucket }.flatten.toSet}

getHash(record,hashers)

DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONSrecords.mapPartitions { iter => val rng = new Scala.util.random() iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)})}

AND YET, OOM

BASIC LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION

With data points, choose and , to solve the problem

WEB LOGS ARE SPARSE

Input : hits per user, over 6 months, 2x50-ish integers/user (4GB)

Output of length 1000 integers per user : 10 (parallel) bands, 100 (concatenated) hashes

64-bit integers : 40 GB

Yet !23

ENTROPY LSH (PANIGRAPHI 2006)REPLACE TABLES BY OFFSETS

, , chosen randomly from the surfaceof , the sphere of radius centered at

ENTROPY LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION

With data points, choose and

, to solve the problem with asfew as hash tables

BUT ... NETWORK COSTS

▸ Basic LSH : look up buckets,

▸ Entropy LSH : search for offsets

LAYERED LSH (BAHMANI ET AL. 2012)

Output of your LSH family is in , with e.g. a cosine norm.

For closer points, the chance of hashes hashing to the same bucket is high!

LAYERED LSH

Have an LSH family for your norm on

Likely that for all offsets

LAYERED LSH

Output of hash generation is (GH(p), (H(p), p)) for all p.

In Spark, group, or custom partitioner for (H(p), p) RDD.

Network cost :

PERFORMANCE

FUTURE WORKHAVE A (BIG) WEBLOG ?

▸ Weve▸ Yandex

FUTURE WORKLOCALITY-SENSITIVE HASHING FORESTS !

RELEASEgithub.com/huitseeker/spark-lsh

1 SEPT 2015

A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

Software

Transcript of A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

1 Finding Similar Pairs Divide-Compute-Merge Locality-Sensitive Hashing Applications.

Bilinear Random Projections for Locality-Sensitive Binary ...€¦ · {kshkawa,seungjin}@postech.ac.kr Abstract Locality-sensitive hashing (LSH) is a popular data-independent indexing

Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation

Summer School on Hashing’14 Locality Sensitive Hashing Alex Andoni (Microsoft Research)

Locality sensitive hashing

1 Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing.

1 Applications of LSH (Locality-Sensitive Hashing) Entity Resolution Fingerprints Similar News Articles.

Дмитрий Селиванов, OK.RU. Finding Similar Items in high-dimensional spaces: Locality Sensitive Hashing

Use of Locality Sensitive Hashing (LSH) Algorithm to Match ...ceur-ws.org/Vol-1823/paper3.pdf · Use of Locality Sensitive Hashing (LSH) Algorithm to Match Web of Science and SCOPUS

Lecture 12: Locality-Sensitive Hashing and MinHash

Project - Deep Locality Sensitive Hashing

High Dimensional Search Min-Hashing Locality Sensitive Hashing Debapriyo Majumdar Data Mining – Fall 2014 Indian Statistical Institute Kolkata September.

Efï¬cient Online Locality Sensitive Hashing via Reservoir Counting

Big Data Lecture 6: Locality Sensitive Hashing (LSH)

Locality Sensitive Hashing for Protein Classification...Locality-Sensitive Hashing for Protein Classification Lawrence Buckingham, James M. Hogan, Shlomo Geva, Wayne Kelly School of

Finding similar items in high dimensional spaces locality sensitive hashing

LOCALITY PRESERVING HASHING Yi-Hsuan Tsai Ming ...vllab.ucmerced.edu/ytsai/ICIP14/icip14_lph.pdfIndex Terms— Hashing, visual search, image retrieval 1. INTRODUCTION Large-scale image

DATA MINING LECTURE 6 Sketching, Min-Hashing, Locality Sensitive Hashing.

DATA MINING LECTURE 5 MinHashing, Locality Sensitive Hashing, Clustering.

Locality-sensitive hashing and biological network alignment