A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

Post on 14-Apr-2017

14.538 views 3 download

Transcript of A Gentle Introduction to Locality Sensitive Hashing with Apache Spark

A GENTLE INTRODUCTION TO APACHE SPARK AND LOCALITY-SENSITIVE

HASHING1

FRANCOIS GARILLOT(FORMERLY) TYPESAFE

francois@garillot.net

@huitseeker

2

LOCALITY-SENSITIVE HASHING

▸ A story : Why LSH▸ How it works & hash families

▸ LSH distribution▸ Beware : WIP

3

SPARK TENETS

▸ broadcast variables▸ per-partition commands▸ shuffle sparsely

4

5

6

7

SEGMENTATION

▸ small sample: 289421 users▸ larger sample : 5684403 users

46K websites, ultimately users4 personal laptops, 4 provided laptops

8

K-MEANS COMPLEXITY

Find with the 'elbow method' on within-cluster sum of squares. Then

9

EM - GAUSSIAN MIXTURE

With dimensions, mixtures,

10

LOCALITY-SENSITIVE HASHING FUNCTIONSA family H of hashing functions is -sensitive if:

▸ if then ▸ if then

11

DISTANCES ! (THOSE AND MANY OTHER)

▸ Hamming distance : where is arandomly chosen index

▸ Jaccard :

▸ Cosine distance:

12

EARTH MOVER'S DISTANCE

13

EARTH MOVER'S DISTANCE

Find optimal F minimizing:

Then:

14

A WORD ON MODULARITY

LSH for EMD introduced by Charikar in the Simhash paper (2002).

Yet no place to plug your LSH family in implementation (e.g. scikit, mrsqueeze) !

15

LSH AMPLIFICATION : CONCATENATIONS AND PARALLEL

▸ basic LSH:

▸ AND (series) construction: ▸ OR (parallel) construction :

16

17

BASIC LSH val hashCollection = records.map(s => (getId(s), s)). mapValues(s => getHash(s, hashers)) val subArray = hashCollection.flatMap { case (recordId, hash) => hash.grouped(hashLength / numberBands).zipWithIndex.map{ case (band, bandIndex) => (bandIndex, (band, sentenceId)) } }

18

LOOKUPdef findCandidates(record: Iterable[String], hashers: Array[Int => Int], mBands: BandType) = { val hash = getHash(record, hashers) val subArrays = partitionArray(hash).zipWithIndex

subArrays.flatMap { case (band, bandIndex) => val hashedBucket = mBands.lookup(bandIndex). headOption. flatMap{_.get(band)} hashedBucket }.flatten.toSet}

19

getHash(record,hashers)

DISTRIBUTE RANDOM SEEDS, NOT PERMUTATION FUNCTIONSrecords.mapPartitions { iter => val rng = new Scala.util.random() iter.map(x => hashers.flatMap{h => getHashFunction(rng, h)(x)})}

20

AND YET, OOM

21

BASIC LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION

With data points, choose and , to solve the problem

22

WEB LOGS ARE SPARSE

Input : hits per user, over 6 months, 2x50-ish integers/user (4GB)

Output of length 1000 integers per user : 10 (parallel) bands, 100 (concatenated) hashes

64-bit integers : 40 GB

Yet !23

ENTROPY LSH (PANIGRAPHI 2006)REPLACE TABLES BY OFFSETS

, , chosen randomly from the surfaceof , the sphere of radius centered at

24

ENTROPY LSHWITH A 2-STABLE GAUSSIAN DISTRIBUTION

With data points, choose and

, to solve the problem with asfew as hash tables

25

BUT ... NETWORK COSTS

▸ Basic LSH : look up buckets,

▸ Entropy LSH : search for offsets

26

LAYERED LSH (BAHMANI ET AL. 2012)

Output of your LSH family is in , with e.g. a cosine norm.

For closer points, the chance of hashes hashing to the same bucket is high!

27

LAYERED LSH

Have an LSH family for your norm on

Likely that for all offsets

28

LAYERED LSH

Output of hash generation is (GH(p), (H(p), p)) for all p.

In Spark, group, or custom partitioner for (H(p), p) RDD.

Network cost :

29

PERFORMANCE

30

FUTURE WORKHAVE A (BIG) WEBLOG ?

▸ Weve▸ Yandex

31

FUTURE WORKLOCALITY-SENSITIVE HASHING FORESTS !

32

RELEASEgithub.com/huitseeker/spark-lsh

1 SEPT 2015

33