Efficient Parallel Set-Similarity Joins Using MapReduce

Efficient Parallel Set-Similarity Joins Using MapReduce

Tilani Gunawardena

Content• Introduction

• Preliminaries• Self-Join case• R-S Join case• Handling insufficient memory• Experimental evaluation• Conclusions

Introduction

• Vast amount of data:– Google N-gram database : ~1 trillion records– GeneBank : 100 million records, size=416GB– Facebook : 400 million active users

• Detecting similar pairs of records becomes a challanging proble

Examples• Detecting near duplicate web-pages in web crawlin• Document clustering• Plagiarism detection • Master data management

– “John W. Smith” , “Smith, John” , “John William Smith”• Making recommendations to users based on their similarity to other users in query refinement• Mining in social networking sites

– User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest• Identifying coalitions of click fraudsters in online advertising

Preliminaries• Problem Statement: Given two collections of

objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)≥ λ

Set -similarity functions• Jaccard or Tanimoto coefficient

– Jaccard(x, y) =|x ∩y| / |x U y|

• “I will call back” =[I, will, call, back]• “I will call you soon”=[I, will, call, you, soon]

• Jaccard similarity=3/6=0.5

Set-similarity with MapReduce• Why Hadoop ?

– Large amount data,shared nothign architecture

• map (k1,v1) -> list(k2,v2);• reduce (k2,list(v2)) -> list(k3,v3)• Problem :

– Too much data to transfer – Too many pairs to verify(Two similar sets share at least 1

token)

Set-Similarity Filtering

• Efficient set-similarity join algorithms rely on effective filters

• string s =“I will call back”• global token ordering {back,call, will, I}• prefix of length 2 of s= [back, call]

• prefix filtering principle states that similar strings need to share at least one common token in their prefixes.

Prefix filtering: example

9

• Each set has 5 tokens• “Similar”: they share at least 4 tokens• Prefix length: 2

Record 1

Record 2

Parallel Set-Similarity Joins• Stage I: Token Ordering

– Compute data statistics for good signatures• Stage II -RID-Pair Generation• Stage III: Record Join

– Generate actual pairs of joined records

Input Data

• RID = Row ID• a : join column • “A B C” is a string:

• Address: “14th Saarbruecker Strasse”• Name: “John W. Smith”

Stage I: Token Ordering

• Basic Token Ordering(BTO)• One Phase Token Ordering (OPTO)

Token Ordering

• Creates a global ordering of the tokens in the join column, based on their frequency

1 A B D A A … …2 B B D A E … …

RID a b c

Global Ordering:(based on frequency)

E D B A

1 2 3 4

Basic Token Ordering(BTO)

• 2 MapReduce cycles:– 1st : compute token frequencies– 2nd: sort the tokens by their frequencies

Basic Token Ordering – 1st MapReduce cycle

map:• tokenize the join

value of each record• emit each token

with no. of occurrences 1

, ,

reduce:• for each token, compute total

count (frequency)

Basic Token Ordering – 2nd MapReduce cycle

map:• interchange key

with value

reduce(use only 1 reducer):• emits the value

One Phase Tokens Ordering (OPTO)

• alternative to Basic Token Ordering (BTO):– Uses only one MapReduce Cycle (less I/O)– In-memory token sorting, instead of using a

reducer

OPTO – Details

map:• tokenize the join

value of each record• emit each token

with no. of occurrences 1

, ,

reduce:• for each token, compute

total count (frequency)

Use tear_down method to order the tokens in memory

Stage II: RID-Pair Generation

Basic Kernel(BK) Indexed Kernel(PK)

RID-Pair Generation

• scans the original input data(records) • outputs the pairs of RIDs corresponding to records

satisfying the join predicate(sim)• consists of only one MapReduce cycle

Global ordering of tokens obtained in the previous stage

RID-Pair Generation: Map Phase

• scan input records and for each record:– project it on RID & join attribute– tokenize it– extract prefix according to global ordering of tokens obtained in the Token

Ordering stage– route tokens to appropriate reducer

Grouping/Routing Strategies

• Goal: distribute candidates to the right reducers to minimize reducers’ workload

• Like hashing (projected)records to the corresponding candidate-buckets

• Each reducer handles one/more candidate-buckets

• 2 routing strategies:

Using Individual Tokens Using Grouped Tokens

Routing: using individual tokens

• Treat each token as a key• For each record, generates a (key, value) pair for each

of its prefix tokens:Example: • Given the global ordering:

Token A B E D G C F

Frequency 10 10 22 23 23 40 48

“A B C” => prefix of length 2: A,B => generate/emit 2 (key,value) pairs:

• (A, (1,A B C))• (B, (1,A B C))

Grouping/Routing: using individual tokens

• Advantage: – high quality of grouping of candidates( pairs of

records that have no chance of being similar, are never routed to the same reducer)

• Disadvantage: – high replication of data (same records might be

checked for similarity in multiple reducers, i.e. redundant work)

Routing: Using Grouped Tokens

“A B C” => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs:

• (X, (1,A B C))• (Y, (1,A B C))

Example: • Given the global ordering:

Token A B E D G C F

Frequency 10 10 22 23 23 40 48

• Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key)

• For each record, generates a (key, value) pair for each the groups of the prefix tokens:

Grouping/Routing: Using Grouped Tokens

• The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner

Token A B E D G C F

Frequency 10 10 22 23 23 40 48

Group1 Group3Group2

A D F B EG C

Grouping/Routing: Using Grouped Tokens

• Advantage: – fewer replication of record projection

• Disadvantage:– Quality of grouping is not so high (records having no chance

of being similar are sent to the same reducer which checks their similarity)

– “ABCD” (A,B belong to Group X ; C belong to Group Y)• o/p –(X,_) & (Y,_)

– “EFG” (E belong to Group Y )• o/p –(Y,_)

RID-Pair Generation: Reduce Phase

• This is the core of the entire method • Each reducer processes one/more buckets• In each bucket, the reducer looks for pairs of join attribute values

satisfying the join predicate

Bucket of candidates

If the similarity of the 2 candidates >= threshold => output their ids and also their similarity

RID-Pair Generation: Reduce Phase

• Computing similarity of the candidates in a bucket comes in 2 flavors:

• Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket

• Indexed Kernel : uses a PPJoin+ index

RID-Pair Generation: Basic Kernel

• Straightforward method for finding candidates satisfying the join predicate

• Quadratic complexity : O(#candidates2)

RID-Pair Generation:PPJoin+Indexed Kernal

• Uses a special index data structure• Not so straightforward to implement• map() -same as in BK algorithm• Much more efficient

Stage III: Record Join

• Until now we have only pairs of RIDs, but we need actual records

• Use the RID pairs generated in the previous stage to join the actual records

• Main idea: – bring in the rest of the each record (everything except the RID

which we already have)

• 2 approaches:– Basic Record Join (BRJ)– One-Phase Record Join (OPRJ)

Record Join: Basic Record Join

• Uses 2 MapReduce cycles– 1st cycle: fills in the record information for each half of each pair

– 2nd cycle: brings together the previously filled in records

Record Join: One Phase Record Join

• Uses only one MapReduce cycle

R-S Join

• Challenge: We now have 2 different record sources => 2 different input streams

• Map Reduce can work on only 1 input stream

• 2nd and 3rd stage affected

• Solution: extend (key, value) pairs so that it includes a relation tag for each record

Handling Insufficient Memory• Map-Based Block Processing.• Reduce-Based Block Processing

Evaluation

• Cluster: 10-node IBM x3650, running Hadoop• Data sets:

• DBLP: 1.2M publications• CITESEERX: 1.3M publication• Consider only the header of each paper(i.e author, title, date of

publication, etc.)• Data size synthetically increased (by various factors)

• Measure:• Absolute running time• Speedup• Scaleup

Self-Join running time

• Best algorithm: BTO-PK-OPRJ• Most expensive stage: the

RID-pair generation

Self-Join Speedup

• Fixed data size, vary the cluster size

• Best time: BTO-PK-OPRJ

Self-Join Scaleup

• Increase data size and cluster size together by the same factor

• Best time: BTO-PK-OPRJ

Self-Join Summery

• I stage- BTO was the best choice.• II stage- PK was the best choice.• III stage,-the best choice depends on the amount

of data and the size of the cluster– OPRJ was somewhat faster, but the cost of loading the

similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative

• Best scaleup was achieved by BTO-PK-BRJ

R-S Join Performance

Speed Up

• I stage - R-S Join performance was identical to the first stage in the self-join case

• II stage -noticed a similar speedup (almost perfect) as for the self-join case.

• III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.

Conclusions

• For both self-join and R-S join cases, we recommend BTO-PK-BRJ as a robust and scalable method.

• Useful in many data cleaning scenarios

• SSJoin and MapReduce: one solution for huge datasets

• Very efficient when based on prefix-filtering and PPJoin+

• Scales-up up nicely

Thank You!

Efficient Parallel Set-Similarity Joins Using MapReduce

Education

Transcript of Efficient Parallel Set-Similarity Joins Using MapReduce