Post on 04-Mar-2021
5/12/17
1
Mining of Massive DatasetsJure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University
http://www.mmds.org
Note to other teachers and users of these slides: We would be delighted if you found this our material useful in giving your own lectures. Feel free to use these slides verbatim, or to modify them to fit your own needs. If you make use of a significant portion of these slides in your own lecture, please include this message, or a link to our web site: http://www.mmds.org
High dim. data
Localitysensitivehashing
Clustering
Dimensionality
reduction
Graph data
PageRank,SimRank
NetworkAnalysis
SpamDetection
Infinite data
Filteringdata
streams
Webadvertising
Queriesonstreams
Machine learning
SVM
DecisionTrees
Perceptron,kNN
Apps
Recommendersystems
AssociationRules
Duplicatedocumentdetection
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 2
5/12/17
2
3J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
[Hays and Efros, SIGGRAPH 2007]
4J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
[Hays and Efros, SIGGRAPH 2007]
5/12/17
3
10 nearest neighbors from a collection of 20,000 imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 5
[Hays and Efros, SIGGRAPH 2007]
10 nearest neighbors from a collection of 2 million imagesJ.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 6
[Hays and Efros, SIGGRAPH 2007]
5/12/17
4
¡ Manyproblemscanbeexpressedasfinding“similar”sets:§ Findnear-neighborsinhigh-dimensional space
¡ Examples:§ Pageswithsimilarwords
§ Forduplicatedetection,classificationbytopic§ Customerswhopurchasedsimilarproducts
§ Productswithsimilarcustomersets§ Imageswithsimilarfeatures
§ Userswhovisitedsimilarwebsites
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 7
¡ Given:Highdimensionaldatapoints𝒙𝟏, 𝒙𝟐, …§ Forexample: Imageisalongvectorofpixelcolors
1 2 10 2 10 1 0
→ [121021010]
¡ Andsomedistancefunction𝒅(𝒙𝟏, 𝒙𝟐)§ Whichquantifiesthe“distance”between𝒙𝟏 and𝒙𝟐
¡ Goal: Findallpairsofdatapoints(𝒙𝒊, 𝒙𝒋) thatarewithinsomedistancethreshold𝒅 𝒙𝒊, 𝒙𝒋 ≤ 𝒔
¡ Note: Naïvesolutionwouldtake𝑶 𝑵𝟐 Lwhere𝑵 isthenumberofdatapoints
¡ MAGIC:Thiscanbedonein𝑶 𝑵 !!How?J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 8
5/12/17
5
¡ Lasttime: Findingfrequentpairs
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 9
Item
s 1…
N
Items 1…N
Count of pair {i,j} in the data
Naïve solution:Single pass but requires space quadratic in the number of items
Item
s 1…
K
Items 1…K
Count of pair {i,j}in the data
A-Priori:First pass: Find frequent singletonsFor a pair to be a frequent pair candidate, its singletons have to be frequent!Second pass:Count only candidate pairs!
N … number of distinct itemsK … number of items with support ³ s
¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:
§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 10
Items 1…N
Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}
Buckets 1…B2 1
5/12/17
6
¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:
§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:
§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletons{i},{j}havetobefrequentandthepairhastohashtoafrequentbucket!
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 11
Items 1…N
Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}
Buckets 1…B3 1 2
¡ Lasttime: Findingfrequentpairs¡ Furtherimprovement:PCY§ Pass1:
§ Countexactfrequencyofeachitem:§ Takepairsofitems{i,j},hashthemintoB bucketsandcountofthenumberofpairsthathashedtoeachbucket:
§ Pass2:§ Forapair{i,j}tobeacandidateforafrequentpair,itssingletonshavetobefrequentanditshastohashtoafrequentbucket!
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 12
Items 1…N
Basket 1: {1,2,3}Pairs: {1,2} {1,3} {2,3}Basket 2: {1,2,4}Pairs: {1,2} {1,4} {2,4}
Buckets 1…B3 1 2
Previous lecture: A-PrioriMain idea: CandidatesInstead of keeping a count of each pair, only keep a count of candidate pairs!
Today’s lecture: Find pairs of similar docsMain idea: Candidates-- Pass 1: Take documents and hash them to buckets such that documents that are similar hash to the same bucket-- Pass 2: Only compare documents that are candidates (i.e., they hashed to a same bucket)Benefits: Instead of O(N2) comparisons, we need O(N) comparisons to find similar documents
5/12/17
7
¡ Goal: Findnear-neighborsinhigh-dim.space§ Weformallydefine“nearneighbors”aspointsthatarea“smalldistance”apart
¡ Foreachapplication,wefirstneedtodefinewhat“distance”means
¡ Today:Jaccard distance/similarity§ TheJaccard similarity oftwosets isthesizeoftheirintersectiondividedbythesizeoftheirunion:sim(C1,C2)=|C1ÇC2|/|C1ÈC2|
§ Jaccard distance: d(C1,C2)=1- |C1ÇC2|/|C1ÈC2|
14J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
3 in intersection8 in unionJaccard similarity= 3/8Jaccard distance = 5/8
5/12/17
8
¡ Goal: Givenalargenumber(𝑵 inthemillionsorbillions)ofdocuments,find“nearduplicate”pairs
¡ Applications:§ Mirrorwebsites,orapproximatemirrors
§ Don’twanttoshowbothinsearchresults§ Similarnewsarticlesatmanynewssites
§ Clusterarticlesby“samestory”¡ Problems:
§ Manysmallpiecesofonedocumentcanappearoutoforderinanother
§ Toomanydocumentstocompareallpairs§ Documentsaresolargeorsomanythattheycannotfitinmainmemory
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 15
1. Shingling: Convertdocumentstosets
2. Min-Hashing: Convertlargesetstoshortsignatures,whilepreservingsimilarity
3. Locality-SensitiveHashing: Focusonpairsofsignatureslikelytobefromsimilardocuments
§ Candidatepairs!
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 16
5/12/17
9
17
Docu-ment
The setof stringsof length kthat appearin the doc-ument
Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity
Locality-SensitiveHashing
Candidatepairs:those pairsof signaturesthat we needto test forsimilarity
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Step1: Shingling: Convertdocumentstosets
Docu-ment
The setof stringsof length kthat appearin the doc-ument
5/12/17
10
¡ Step1: Shingling: Convertdocumentstosets
¡ Simpleapproaches:§ Document=setofwordsappearingindocument§ Document=setof“important”words§ Don’tworkwellforthisapplication.Why?
¡ Needtoaccountfororderingofwords!¡ Adifferentway:Shingles!
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 19
¡ Ak-shingle (ork-gram)foradocumentisasequenceofktokensthatappearsinthedoc§ Tokenscanbecharacters,wordsorsomethingelse,dependingontheapplication
§ Assumetokens=charactersforexamples
¡ Example: k=2;documentD1=abcabSetof2-shingles:S(D1)={ab,bc,ca}§ Option: Shinglesasabag(multiset),countabtwice:S’(D1)={ab,bc,ca, ab}
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 20
5/12/17
11
¡ Tocompresslongshingles,wecanhash themto(say)4bytes
¡ Representadocumentbythesetofhashvaluesofitsk-shingles§ Idea: Twodocumentscould(rarely)appeartohaveshinglesincommon,wheninfactonlythehash-valueswereshared
¡ Example: k=2;documentD1= abcabSetof2-shingles:S(D1)={ab,bc,ca}Hashthesingles:h(D1)={1,5,7}
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 21
¡ DocumentD1isasetofitsk-shinglesC1=S(D1)¡ Equivalently,eachdocumentisa0/1vectorinthespaceofk-shingles§ Eachuniqueshingleisadimension§ Vectorsareverysparse
¡ AnaturalsimilaritymeasureistheJaccard similarity:
sim(D1,D2)=|C1ÇC2|/|C1ÈC2|
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 22
5/12/17
12
¡ Documentsthathavelotsofshinglesincommonhavesimilartext,evenifthetextappearsindifferentorder
¡ Caveat: Youmustpickk largeenough,ormostdocumentswillhavemostshingles§ k =5isOKforshortdocuments§ k =10isbetterforlongdocuments
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 23
¡ Supposeweneedtofindnear-duplicatedocumentsamong𝑵 = 𝟏milliondocuments
¡ Naïvely,wewouldhavetocomputepairwiseJaccard similaritiesforeverypairofdocs§ 𝑵(𝑵 − 𝟏)/𝟐 ≈5*1011 comparisons§ At105 secs/dayand106 comparisons/sec,itwouldtake5days
¡ For𝑵 = 𝟏𝟎million,ittakesmorethanayear…
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 24
5/12/17
13
Step2:Minhashing: Convertlargesets toshortsignatures,whilepreservingsimilarity
Docu-ment
The setof stringsof length kthat appearin the doc-ument
Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity
¡ Manysimilarityproblemscanbeformalizedasfindingsubsetsthathavesignificantintersection
¡ Encodesetsusing0/1(bit,boolean)vectors§ Onedimensionperelementintheuniversalset
¡ InterpretsetintersectionasbitwiseAND,andsetunionasbitwiseOR
¡ Example: C1 =10111;C2 =10011§ Sizeofintersection=3;sizeofunion=4,§ Jaccard similarity (notdistance)=3/4§ Distance:d(C1,C2)=1– (Jaccard similarity)=1/4
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 26
5/12/17
14
¡ Rows =elements(shingles)¡ Columns =sets(documents)§ 1inrowe andcolumns ifandonlyife isamemberofs
§ ColumnsimilarityistheJaccardsimilarityofthecorrespondingsets(rowswithvalue1)
§ Typicalmatrixissparse!¡ Eachdocumentisacolumn:
§ Example: sim(C1 ,C2)=?§ Sizeofintersection=3;sizeofunion=6,Jaccard similarity(notdistance)=3/6
§ d(C1,C2)=1– (Jaccard similarity)=3/627J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
0101
0111
1001
1000
10101011
0111 Documents
Shin
gles
¡ Wemightnotreallyrepresentthedatabyabooleanmatrix.
¡ Sparsematricesareusuallybetterrepresentedbythelistofplaceswherethereisanon-zerovalue.
¡ Butthematrixpictureisconceptuallyuseful.
28
5/12/17
15
1. Whenthesetsaresolargeorsomanythattheycannotfitinmainmemory.
2. Or,whentherearesomanysetsthatcomparingallpairsofsetstakestoomuchtime.
3. Orboth.
29
¡ Sofar:§ Documents® Setsofshingles§ Representsetsasboolean vectorsinamatrix
¡ Nextgoal:Findsimilarcolumnswhilecomputingsmallsignatures§ Similarityofcolumns==similarityofsignatures
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 30
5/12/17
16
¡ NextGoal:Findsimilarcolumns,Smallsignatures¡ Naïveapproach:
§ 1)Signaturesofcolumns: smallsummariesofcolumns§ 2)Examinepairsofsignatures tofindsimilarcolumns
§ Essential: Similaritiesofsignaturesandcolumnsarerelated
§ 3)Optional: Checkthatcolumnswithsimilarsignaturesarereallysimilar
¡ Warnings:§ Comparingallpairsmaytaketoomuchtime:JobforLSH
§ Thesemethodscanproducefalsenegatives,andevenfalsepositives(iftheoptionalcheckisnotmade)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 31
¡ Keyidea: “hash”eachcolumnC toasmallsignature h(C),suchthat:§ (1) h(C) issmallenoughthatthesignaturefitsinRAM§ (2) sim(C1,C2) isthesameasthe“similarity”ofsignaturesh(C1)andh(C2)
¡ Goal: Findahashfunctionh(·) suchthat:§ Ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ Ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)
¡ Hashdocsintobuckets.Expectthat“most”pairsofnearduplicatedocshashintothesamebucket!
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 32
5/12/17
17
¡ Goal: Findahashfunctionh(·) suchthat:§ ifsim(C1,C2) ishigh,thenwithhighprob.h(C1)=h(C2)§ ifsim(C1,C2) islow,thenwithhighprob.h(C1)≠h(C2)
¡ Clearly,thehashfunctiondependsonthesimilaritymetric:§ Notallsimilaritymetricshaveasuitablehashfunction
¡ ThereisasuitablehashfunctionfortheJaccard similarity: ItiscalledMin-Hashing
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 33
34
¡ Imaginetherowsoftheboolean matrixpermutedunderrandompermutationp
¡ Definea“hash”functionhp(C) =theindexofthefirst (inthepermutedorderp)rowinwhichcolumnC hasvalue1:
hp (C) = minp p(C)
¡ Useseveral(e.g.,100)independenthashfunctions(thatis,permutations)tocreateasignatureofacolumn
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
5/12/17
18
35
3
4
7
2
6
1
5
Signature matrix M
1212
5
7
6
3
1
2
4
1412
4
5
1
6
7
3
2
2121
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
2nd element of the permutation is the first to map to a 1
4th element of the permutation is the first to map to a 1
0101
0101
1010
1010
1010
1001
0101
Input matrix (Shingles x Documents) Permutation p
Note: Another (equivalent) way is to store row indexes: 1 5 1 5
2 3 1 36 4 6 4
¡ Choosearandompermutationp¡ Claim: Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Why?
§ LetX beadoc(setofshingles),yÎ X isashingle§ Then: Pr[p(y)=min(p(X))]=1/|X|
§ ItisequallylikelythatanyyÎ X ismappedtothemin element
§ Lety bes.t. p(y)=min(p(C1ÈC2))§ Theneither: p(y)=min(p(C1))ifyÎ C1,or
p(y)=min(p(C2))ifyÎ C2§ Sotheprob.thatboth aretrueistheprob.y Î C1Ç C2§ Pr[min(p(C1))=min(p(C2))]=|C1ÇC2|/|C1ÈC2|=sim(C1,C2)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 36
01
10
00
11
00
00
One of the twocols had to have1 at position y
5/12/17
19
¡ GivencolsC1 andC2,rowsmaybeclassifiedas:C1 C2
A 1 1B 1 0C 0 1D 0 0
§ a =#rowsoftypeA,etc.¡ Note: sim(C1,C2)=a/(a+b+c)¡ Then: Pr[h(C1)=h(C2)]=Sim(C1,C2)
§ LookdownthecolsC1 andC2 untilweseea1§ Ifit’satype-A row,thenh(C1)=h(C2)Ifatype-B ortype-C row,thennot
37J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
38
¡ Weknow:Pr[hp(C1)=hp(C2)]=sim(C1,C2)¡ Nowgeneralizetomultiplehashfunctions
¡ Thesimilarityoftwosignaturesisthefractionofthehashfunctionsinwhichtheyagree
¡ Note: BecauseoftheMin-Hashproperty,thesimilarityofcolumnsisthesameastheexpectedsimilarityoftheirsignatures
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
5/12/17
20
39J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Similarities:1-3 2-4 1-2 3-4
Col/Col 0.75 0.75 0 0Sig/Sig 0.67 1.00 0 0
Signature matrix M
1212
5
7
6
3
1
2
4
1412
4
5
1
6
7
3
2
2121
0101
0101
1010
1010
1010
1001
0101
Input matrix (Shingles x Documents)
3
4
7
2
6
1
5
Permutation p
¡ PickK=100randompermutationsoftherows¡ Thinkofsig(C) asacolumnvector¡ sig(C)[i]= accordingtothei-th permutation,theindexofthefirstrowthathasa1incolumnCsig(C)[i] = min (pi(C))
¡ Note: Thesketch(signature)ofdocumentC issmall~𝟏𝟎𝟎 bytes!
¡ Weachievedourgoal! We“compressed”longbitvectorsintoshortsignatures
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 40
5/12/17
21
¡ Permutingrowsevenonceisprohibitive¡ Rowhashing!
§ PickK=100 hashfunctionski§ Orderingunderki givesarandomrowpermutation!
¡ One-passimplementation§ ForeachcolumnC andhash-func.ki keepa“slot”forthemin-hashvalue
§ Initializeallsig(C)[i]=¥§ Scanrowslookingfor1s
§ Supposerowj has1incolumnC§ Thenforeachki :
§ Ifki(j)<sig(C)[i],thensig(C)[i]¬ ki(j)J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 41
How to pick a randomhash function h(x)?Universal hashing:ha,b(x)=((a·x+b) mod p) mod Nwhere:a,b … random integersp … prime number (p > N)
¡ Suppose1billionrows.
¡ Hardtopickarandompermutationfrom1…billion.
¡ Representingarandompermutationrequires1billionentries.
¡ Accessingrowsinpermutedorderleadstothrashing.
42
5/12/17
22
¡ Agoodapproximationtopermutingrows:pick100(?)hashfunctions.
¡ Foreachcolumnc andeachhashfunctionhi,keepa“slot” M(i,c).
¡ Intent:M(i,c)willbecomethesmallestvalueofhi(r )forwhichcolumnc has1inrowr.§ I.e.,hi(r )givesorderofrowsfor i thpermuation.
43
Minhashing
InitializeM(i,c)to∞foralliandcfor eachrowrfor eachcolumnc
if chas1inrowrfor eachhashfunctionhi do
if hi(r)isasmallervaluethanM(i,c)then
M(i,c):=hi(r );
44
Minhashing
5/12/17
23
45
Row C1 C21 1 02 0 13 1 14 1 05 0 1
h(x) = x mod 5g(x) = 2x+1 mod 5
Sig1 Sig2
Minhashing
h(1) = 1 1 -g(1) = 3 3 -h(2) = 2 1 2g(2) = 0 3 0
h(3) = 3 1 2g(3) = 2 2 0h(4) = 4 1 2g(4) = 4 2 0h(5) = 0 1 0g(5) = 1 2 0
¡ Often,dataisgivenbycolumn,notrow.§ E.g.,columns=documents,rows=shingles.
¡ Ifso,sortmatrixoncesoitisbyrow.
¡ Andalways computehi(r)onlyonceforeachrow.
46
Minhashing
5/12/17
24
Step3:Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments
Docu-ment
The setof stringsof length kthat appearin the doc-ument
Signatures:short integervectors thatrepresent thesets, andreflect theirsimilarity
Locality-SensitiveHashing
Candidatepairs:those pairsof signaturesthat we needto test forsimilarity
¡ Supposewehave,inmainmemory,datarepresentingalargenumberofobjects.§ Maybetheobjectsthemselves.§ Maybesignaturesasinminhashing.
¡ Wewanttocompareeachtoeach,findingthosepairs thataresufficientlysimilar.
48
5/12/17
25
¡ Whilethesignatures ofallcolumnsmayfitinmainmemory,comparingthesignaturesofallpairs ofcolumnsisquadratic inthenumberofcolumns.
¡ Example:106 columnsimplies5*1011 column-comparisons.
¡ At1microsecond/comparison:6days.
49
¡ Goal:FinddocumentswithJaccard similarityatleasts (forsomesimilaritythreshold,e.g., s=0.8)
¡ LSH– Generalidea: Useafunctionf(x,y) thattellswhetherx andy isacandidatepair: apairofelementswhosesimilaritymustbeevaluated
¡ ForMin-Hashmatrices:§ HashcolumnsofsignaturematrixM tomanybuckets§ Eachpairofdocumentsthathashesintothesamebucketisacandidatepair
50J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
1212
1412
2121
5/12/17
26
¡ Pickasimilaritythresholds (0<s<1)
¡ Columnsx andy ofM areacandidatepair iftheirsignaturesagreeonatleastfractions oftheirrows:M (i,x)=M (i,y) foratleastfrac.s valuesofi§ Weexpectdocumentsx andy tohavethesame(Jaccard)similarityastheirsignatures
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 51
1212
1412
2121
¡ Bigidea: HashcolumnsofsignaturematrixM severaltimes
¡ Arrangethat(only)similarcolumns arelikelytohashtothesamebucket,withhighprobability
¡ Candidatepairsarethosethathashtothesamebucket
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 52
1212
1412
2121
5/12/17
27
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 53
Signature matrix M
r rowsper band
b bands
Onesignature
1212
1412
2121
¡ DividematrixM intob bandsofr rows
¡ Foreachband,hashitsportionofeachcolumntoahashtablewithk buckets§ Makek aslargeaspossible
¡ Candidate columnpairsarethosethathashtothesamebucketfor≥ 1 band
¡ Tune b andr tocatchmostsimilarpairs,butfewnon-similarpairs
54J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
5/12/17
28
Matrix M
r rows b bands
BucketsColumns 2 and 6are probably identical (candidate pair)
Columns 6 and 7 aresurely different.
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 55
¡ Thereareenoughbuckets thatcolumnsareunlikelytohashtothesamebucketunlesstheyareidentical inaparticularband
¡ Hereafter,weassumethat“samebucket”means“identicalinthatband”
¡ Assumptionneededonlytosimplifyanalysis,notforcorrectnessofalgorithm
56J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
5/12/17
29
Assumethefollowingcase:¡ Suppose100,000columnsofM (100kdocs)¡ Signaturesof100integers(rows)¡ Therefore,signaturestake40Mb¡ Chooseb =20bandsofr =5integers/band
¡ Goal: Findpairsofdocumentsthatareatleasts =0.8 similar
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 57
1212
1412
2121
¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.8
§ Sincesim(C1,C2)³ s,wewantC1,C2 tobeacandidatepair:Wewantthemtohashtoat least1commonbucket(atleastonebandisidentical)
¡ ProbabilityC1,C2 identicalinoneparticularband:(0.8)5 =0.328
¡ ProbabilityC1,C2 arenot similarinallofthe20bands:(1-0.328)20 =0.00035§ i.e.,about1/3000thofthe80%-similarcolumnpairsarefalsenegatives (wemissthem)
§ Wewouldfind99.965%pairsoftrulysimilardocuments
58J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
1212
1412
2121
5/12/17
30
¡ Findpairsof³ s=0.8similarity,setb=20,r=5¡ Assume: sim(C1,C2)=0.3
§ Sincesim(C1,C2)<s wewantC1,C2 tohashtoNOcommonbuckets (allbandsshouldbedifferent)
¡ ProbabilityC1,C2 identicalinoneparticularband:(0.3)5 =0.00243
¡ ProbabilityC1,C2 identicalinatleast1of20bands:1- (1- 0.00243)20 =0.0474§ Inotherwords,approximately4.74%pairsofdocswithsimilarity0.3endupbecomingcandidatepairs§ Theyarefalsepositivessincewewillhavetoexaminethem(theyarecandidatepairs)butthenitwillturnouttheirsimilarityisbelowthresholds
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 59
1212
1412
2121
¡ Pick:§ ThenumberofMin-Hashes(rowsofM)§ Thenumberofbandsb,and§ Thenumberofrowsr perbandtobalancefalsepositives/negatives
¡ Example: Ifwehadonly15bandsof5rows,thenumberoffalsepositiveswouldgodown,butthenumberoffalsenegativeswouldgoup
60J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
1212
1412
2121
5/12/17
31
Similarity t =sim(C1, C2) of two sets
Probabilityof sharinga bucket
Sim
ilarit
y th
resh
old
sNo chance
if t < s
Probability = 1 if t > s
61J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 62
Remember:Probability ofequal hash-values= similarity
Similarity t =sim(C1, C2) of two sets
Probabilityof sharinga bucket
5/12/17
32
¡ ColumnsC1 andC2 havesimilarityt¡ Pickanyband(r rows)§ Prob.thatallrowsinbandequal= tr
§ Prob.thatsomerowinbandunequal=1- tr
¡ Prob.thatnobandidentical=(1- tr)b
¡ Prob.thatatleast1bandidentical=1- (1- tr)b
63J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
t r
All rowsof a bandare equal
1 -
Some rowof a bandunequal
( )b
No bandsidentical
1 -
At leastone bandidentical
s ~ (1/b)1/r
64J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
Similarity t=sim(C1, C2) of two sets
Probabilityof sharinga bucket
5/12/17
33
¡ Similaritythresholds¡ Prob.thatatleast1bandisidentical:
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 65
s 1-(1-sr)b
.2 .006
.3 .047
.4 .186
.5 .470
.6 .802
.7 .975
.8 .9996
¡ Pickingr andb togetthebestS-curve§ 50hash-functions(r=5,b=10)
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 66
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Blue area: False Negative rateGreen area: False Positive rate
Similarity
Prob
. sha
ring
a bu
cket
5/12/17
34
¡ TuneM,b,r togetalmostallpairswithsimilarsignatures,buteliminatemostpairsthatdonothavesimilarsignatures
¡ Checkinmainmemorythatcandidatepairsreallydohavesimilarsignatures
¡ Optional: Inanotherpassthroughdata,checkthattheremainingcandidatepairsreallyrepresentsimilardocuments
67J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org
¡ Shingling: Convertdocumentstosets§ WeusedhashingtoassigneachshingleanID
¡ Min-Hashing:Convertlargesetstoshortsignatures,whilepreservingsimilarity§ Weusedsimilaritypreservinghashing togeneratesignatureswithpropertyPr[hp(C1)=hp(C2)]=sim(C1,C2)
§ Weusedhashingtogetaroundgeneratingrandompermutations
¡ Locality-SensitiveHashing:Focusonpairsofsignatureslikelytobefromsimilardocuments§ Weusedhashingtofindcandidatepairs ofsimilarity³ s
J.Leskovec,A.Rajaraman,J.Ullman:MiningofMassiveDatasets,http://www.mmds.org 68