Efficient Parallel Set-Similarity Joins Using MapReduce
-
Upload
tilani-gunawardena -
Category
Education
-
view
1.437 -
download
0
description
Transcript of Efficient Parallel Set-Similarity Joins Using MapReduce
Efficient Parallel Set-Similarity Joins Using MapReduce
Tilani Gunawardena
Content• Introduction
• Preliminaries• Self-Join case• R-S Join case• Handling insufficient memory• Experimental evaluation• Conclusions
Introduction
• Vast amount of data:– Google N-gram database : ~1 trillion records– GeneBank : 100 million records, size=416GB– Facebook : 400 million active users
• Detecting similar pairs of records becomes a challanging proble
Examples• Detecting near duplicate web-pages in web crawlin• Document clustering• Plagiarism detection • Master data management
– “John W. Smith” , “Smith, John” , “John William Smith”• Making recommendations to users based on their similarity to other users in query refinement• Mining in social networking sites
– User [1,0,0,1,1,0,1,0,0,1] & [1,0,0,0,1,0,1,0,1,1] has similar interest• Identifying coalitions of click fraudsters in online advertising
Preliminaries• Problem Statement: Given two collections of
objects/items/records, a similarity metric sim(o1,o2) and a threshold λ , find the pairs of objects/items/records satisfying sim(o1,o2)≥ λ
Set -similarity functions• Jaccard or Tanimoto coefficient
– Jaccard(x, y) =|x ∩y| / |x U y|
• “I will call back” =[I, will, call, back]• “I will call you soon”=[I, will, call, you, soon]
• Jaccard similarity=3/6=0.5
Set-similarity with MapReduce• Why Hadoop ?
– Large amount data,shared nothign architecture
• map (k1,v1) -> list(k2,v2);• reduce (k2,list(v2)) -> list(k3,v3)• Problem :
– Too much data to transfer – Too many pairs to verify(Two similar sets share at least 1
token)
Set-Similarity Filtering
• Efficient set-similarity join algorithms rely on effective filters
• string s =“I will call back”• global token ordering {back,call, will, I}• prefix of length 2 of s= [back, call]
• prefix filtering principle states that similar strings need to share at least one common token in their prefixes.
Prefix filtering: example
9
• Each set has 5 tokens• “Similar”: they share at least 4 tokens• Prefix length: 2
Record 1
Record 2
Parallel Set-Similarity Joins• Stage I: Token Ordering
– Compute data statistics for good signatures• Stage II -RID-Pair Generation• Stage III: Record Join
– Generate actual pairs of joined records
Input Data
• RID = Row ID• a : join column • “A B C” is a string:
• Address: “14th Saarbruecker Strasse”• Name: “John W. Smith”
Stage I: Token Ordering
• Basic Token Ordering(BTO)• One Phase Token Ordering (OPTO)
Token Ordering
• Creates a global ordering of the tokens in the join column, based on their frequency
1 A B D A A … …2 B B D A E … …
RID a b c
Global Ordering:(based on frequency)
E D B A
1 2 3 4
Basic Token Ordering(BTO)
• 2 MapReduce cycles:– 1st : compute token frequencies– 2nd: sort the tokens by their frequencies
Basic Token Ordering – 1st MapReduce cycle
map:• tokenize the join
value of each record• emit each token
with no. of occurrences 1
, ,
reduce:• for each token, compute total
count (frequency)
Basic Token Ordering – 2nd MapReduce cycle
map:• interchange key
with value
reduce(use only 1 reducer):• emits the value
One Phase Tokens Ordering (OPTO)
• alternative to Basic Token Ordering (BTO):– Uses only one MapReduce Cycle (less I/O)– In-memory token sorting, instead of using a
reducer
OPTO – Details
map:• tokenize the join
value of each record• emit each token
with no. of occurrences 1
, ,
reduce:• for each token, compute
total count (frequency)
Use tear_down method to order the tokens in memory
Stage II: RID-Pair Generation
Basic Kernel(BK) Indexed Kernel(PK)
RID-Pair Generation
• scans the original input data(records) • outputs the pairs of RIDs corresponding to records
satisfying the join predicate(sim)• consists of only one MapReduce cycle
Global ordering of tokens obtained in the previous stage
RID-Pair Generation: Map Phase
• scan input records and for each record:– project it on RID & join attribute– tokenize it– extract prefix according to global ordering of tokens obtained in the Token
Ordering stage– route tokens to appropriate reducer
Grouping/Routing Strategies
• Goal: distribute candidates to the right reducers to minimize reducers’ workload
• Like hashing (projected)records to the corresponding candidate-buckets
• Each reducer handles one/more candidate-buckets
• 2 routing strategies:
Using Individual Tokens Using Grouped Tokens
Routing: using individual tokens
• Treat each token as a key• For each record, generates a (key, value) pair for each
of its prefix tokens:Example: • Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
“A B C” => prefix of length 2: A,B => generate/emit 2 (key,value) pairs:
• (A, (1,A B C))• (B, (1,A B C))
Grouping/Routing: using individual tokens
• Advantage: – high quality of grouping of candidates( pairs of
records that have no chance of being similar, are never routed to the same reducer)
• Disadvantage: – high replication of data (same records might be
checked for similarity in multiple reducers, i.e. redundant work)
Routing: Using Grouped Tokens
“A B C” => prefix of length 2: A,B Suppose A,B belong to group X and C belongs to group Y => generate/emit 2 (key,value) pairs:
• (X, (1,A B C))• (Y, (1,A B C))
Example: • Given the global ordering:
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
• Multiple tokens mapped to one synthetic key (different tokens can be mapped to the same key)
• For each record, generates a (key, value) pair for each the groups of the prefix tokens:
Grouping/Routing: Using Grouped Tokens
• The groups of tokens (X,Y) are formed assigning tokens to groups in a Round-Robin manner
Token A B E D G C F
Frequency 10 10 22 23 23 40 48
Group1 Group3Group2
A D F B EG C
Grouping/Routing: Using Grouped Tokens
• Advantage: – fewer replication of record projection
• Disadvantage:– Quality of grouping is not so high (records having no chance
of being similar are sent to the same reducer which checks their similarity)
– “ABCD” (A,B belong to Group X ; C belong to Group Y)• o/p –(X,_) & (Y,_)
– “EFG” (E belong to Group Y )• o/p –(Y,_)
RID-Pair Generation: Reduce Phase
• This is the core of the entire method • Each reducer processes one/more buckets• In each bucket, the reducer looks for pairs of join attribute values
satisfying the join predicate
Bucket of candidates
If the similarity of the 2 candidates >= threshold => output their ids and also their similarity
RID-Pair Generation: Reduce Phase
• Computing similarity of the candidates in a bucket comes in 2 flavors:
• Basic Kernel : uses 2 nested loops to verify each pair of candidates in the bucket
• Indexed Kernel : uses a PPJoin+ index
RID-Pair Generation: Basic Kernel
• Straightforward method for finding candidates satisfying the join predicate
• Quadratic complexity : O(#candidates2)
RID-Pair Generation:PPJoin+Indexed Kernal
• Uses a special index data structure• Not so straightforward to implement• map() -same as in BK algorithm• Much more efficient
Stage III: Record Join
• Until now we have only pairs of RIDs, but we need actual records
• Use the RID pairs generated in the previous stage to join the actual records
• Main idea: – bring in the rest of the each record (everything except the RID
which we already have)
• 2 approaches:– Basic Record Join (BRJ)– One-Phase Record Join (OPRJ)
Record Join: Basic Record Join
• Uses 2 MapReduce cycles– 1st cycle: fills in the record information for each half of each pair
– 2nd cycle: brings together the previously filled in records
Record Join: One Phase Record Join
• Uses only one MapReduce cycle
R-S Join
• Challenge: We now have 2 different record sources => 2 different input streams
• Map Reduce can work on only 1 input stream
• 2nd and 3rd stage affected
• Solution: extend (key, value) pairs so that it includes a relation tag for each record
Handling Insufficient Memory• Map-Based Block Processing.• Reduce-Based Block Processing
Evaluation
• Cluster: 10-node IBM x3650, running Hadoop• Data sets:
• DBLP: 1.2M publications• CITESEERX: 1.3M publication• Consider only the header of each paper(i.e author, title, date of
publication, etc.)• Data size synthetically increased (by various factors)
• Measure:• Absolute running time• Speedup• Scaleup
Self-Join running time
• Best algorithm: BTO-PK-OPRJ• Most expensive stage: the
RID-pair generation
Self-Join Speedup
• Fixed data size, vary the cluster size
• Best time: BTO-PK-OPRJ
Self-Join Scaleup
• Increase data size and cluster size together by the same factor
• Best time: BTO-PK-OPRJ
Self-Join Summery
• I stage- BTO was the best choice.• II stage- PK was the best choice.• III stage,-the best choice depends on the amount
of data and the size of the cluster– OPRJ was somewhat faster, but the cost of loading the
similar-RID pairs in memory was constant as the the cluster size increased, and the cost increased as the data size increased. For these reasons, we recommend BRJ as a good alternative
• Best scaleup was achieved by BTO-PK-BRJ
R-S Join Performance
Speed Up
• I stage - R-S Join performance was identical to the first stage in the self-join case
• II stage -noticed a similar speedup (almost perfect) as for the self-join case.
• III stage - OPRJ approach was initially the fastest (for the 2 and 4 node case), but it eventually became slower than the BRJ approach.
Conclusions
• For both self-join and R-S join cases, we recommend BTO-PK-BRJ as a robust and scalable method.
• Useful in many data cleaning scenarios
• SSJoin and MapReduce: one solution for huge datasets
• Very efficient when based on prefix-filtering and PPJoin+
• Scales-up up nicely
Thank You!