Theta join (M-bucket-I algorithm explained)
-
Upload
minsub-yim -
Category
Engineering
-
view
49 -
download
2
Transcript of Theta join (M-bucket-I algorithm explained)
![Page 1: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/1.jpg)
Processing Theta Joins using MapReduce
by Minsub Yim
![Page 2: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/2.jpg)
Processing pipeline at a reducer
Goal: We want to minimize job completion time. Since it’s a function of both input and output, we need a way to model both inputs and outputs to a reducer.
Reducer Join OutputMapper Output
time = f(input size) time = f(output size)
Receive Mapper Output
Sort input by key
Read input
Run join algorithm
Send join output
![Page 3: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/3.jpg)
Theta Join Model
S_id Value
1 5
2 6
3 6
4 8
5 8
6 10
Dataset S Dataset TT_id Value
1 5
2 5
3 6
4 8
5 8
6 10
Assuming join condition: S.value = T.value
![Page 4: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/4.jpg)
Theta Join Model
S_id Value
1 5
2 6
3 6
4 8
5 8
6 10
Dataset S Dataset TT_id Value
1 5
2 5
3 6
4 8
5 8
6 10
Assuming join condition: S.value = T.value
5 5 6 8 8 105668810
[ Join Matrix M ]
: tuple satisfying the join condition
ST
![Page 5: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/5.jpg)
Theta Join Model (Examples)
5 5 6 8 8 1056688
10
Join condition: S.value <= T.value
ST 5 5 6 8 8 10
5668810
Join condition: abs (S.value - T.value) < 2
ST 5 5 6 8 8 10
5668810
Join condition: S.value = T.value
ST
![Page 6: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/6.jpg)
Theta Join Model (Examples)
5 5 6 8 8 1056688
10
Join condition: S.value <= T.value
ST 5 5 6 8 8 10
5668810
Join condition: abs (S.value - T.value) < 2
ST 5 5 6 8 8 10
5668810
Join condition: S.value = T.value
ST
![Page 7: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/7.jpg)
Theta Join Model (Examples)
5 5 6 8 8 1056688
10
Join condition: S.value <= T.value
ST 5 5 6 8 8 10
5668810
Join condition: abs (S.value - T.value) < 2
ST 5 5 6 8 8 10
5668810
Join condition: S.value = T.value
ST
![Page 8: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/8.jpg)
Goal Revisited
• We want to minimize job completion time
• We need to assign every true cell to exactly one reducer. (find a mapping from M to R)
![Page 9: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/9.jpg)
Goal Revisited
• We want to minimize job completion time
• We need to assign every true cell to exactly one reducer. (find a mapping from M to R)
• Goal: Find a mapping from the join matrix M to reducers that minimizes job completion time
![Page 10: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/10.jpg)
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)
(2)
(3)
(4)
[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4
5 5 6 8 8 105668810
Join condition: S.value = T.value
ST
(1)(2)
(3)(4)
[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
![Page 11: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/11.jpg)
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)
(2)
(3)
(4)
[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4
5 5 6 8 8 105668810
Join condition: S.value = T.value
ST
(1)(2)
(3)(4)
[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
![Page 12: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/12.jpg)
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)
(2)
(3)
(4)
[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4
5 5 6 8 8 105668810
Join condition: S.value = T.value
ST
(1)(2)
(3)(4)
[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
![Page 13: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/13.jpg)
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)
(2)
(3)
(4)
[R1] Input: S1, T1, T2 Output: 2 tuples ![R2] Input: S2, S3, T3 Output: 2 tuples ![R3] Input: S4, S5, T4, T5 Output: 4 tuples ![R4] Input: S6, T6 Output: 1 tuple !Max-Reducer-Input: 4 Max-Reducer-Output: 4
5 5 6 8 8 105668810
Join condition: S.value = T.value
ST
(1)(2)
(3)(4)
[R1] Input: S1, S4, S5, T1, T4, T5 Output: 3 tuples ![R2] Input: S2, S4, T3,T5 Output: 2 tuples ![R3] Input: S1, S5, T2, T4 Output: 2 tuples ![R4] Input: S3, S6, T3, T6 Output: 2 tuples !MRI: 6 MRO: 3
(1)
(1)
(2)
(3)
(4)
Stndard equi-join algorithm Random
![Page 14: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/14.jpg)
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)(2)
(3)
[R1] Input: S1, S2, T1, T2 Output: 2 tuples ![R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ![R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples !!Max-Reducer-Input: 6 Max-Reducer-Output: 5
![Page 15: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/15.jpg)
Mappings from join matrix to reducers
5 5 6 8 8 1056688
10
Join condition: S.value = T.value
ST
(1)(2)
(3)
[R1] Input: S1, S2, T1, T2 Output: 2 tuples ![R2] Input: S3, S4, T1, T2, T3 Output: 2 tuples ![R3] Input: S4, S5, S6, T4, T5, T6 Output: 5 tuples !!Max-Reducer-Input: 6 Max-Reducer-Output: 5
![Page 16: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/16.jpg)
Mappings from join matrix to reducers
• We see there could be many possible mappings from join matrix to reducers
• We will see in different cases, which mapping is (close to) optimal and algorithms to compute such mapping.
![Page 17: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/17.jpg)
LemmaWe will be using the following lemma repeatedly to show how (close to) optimal each mapping is.
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples
[ Proof ] Consider a reducer r that receives m records from T and n records from S. Then,
!!
2pc
mn � c2pmn � 2
pc
m+ n � 2pc
![Page 18: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/18.jpg)
LemmaWe will be using the following lemma repeatedly to show how (close to) optimal each mapping is.
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples
[ Proof ] Consider a reducer r that receives m records from T and n records from S. Then,
!!
2pc
mn � c2pmn � 2
pc
m+ n � 2pc
![Page 19: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/19.jpg)
Cross Product• We first consider cross product, where all of
tuples from two datasets satisfy the join condition. The join matrix would look like the following:
5 5 6 8 8 105668810
Join condition: S X T
ST
![Page 20: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/20.jpg)
Cross Product• We first consider cross product, where all of
tuples from two datasets satisfy the join condition. The join matrix would look like the following:
5 5 6 8 8 105668810
Join condition: S X T
ST
![Page 21: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/21.jpg)
Cross Product• Since all entries of the join matrix are true, we
can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.)
• Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI):
MRI
� |S||T |/r
� 2
r|S||T |
r
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2
pc
![Page 22: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/22.jpg)
Cross Product• Since all entries of the join matrix are true, we
can see that the maximum-reducer-output (MRO) . (Otherwise, there would be tuples not mapped to a reducer.
• Along with Lemma 1, we have a lower bound for the maximum-reducer-input (MRI):
MRI
� |S||T |/r
� 2
r|S||T |
r
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2
pc
![Page 23: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/23.jpg)
Cross Product• We will revisit these two properties frequently to
see the quality of join mappings:
� |S||T |/rMRO and MRI � 2
r|S||T |
r
![Page 24: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/24.jpg)
p|S||T |/rCase 1: Suppose |S| and |T| are multiples of .
Namely, and .|S| = csp|S||T |/r |T | = cT
p|S||T |/r
Then, partitioning the join matrix with squares of size is an optimal mapping.p
|S||T |/r
Proof : is trivial. Each region mapped to a reducer !has output size: and input size: |S||T |/r 2
r|S||T |
r
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
![Page 25: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/25.jpg)
p|S||T |/rCase 1: Suppose |S| and |T| are multiples of .
Namely, and .|S| = csp|S||T |/r |T | = cT
p|S||T |/r
Then, partitioning the join matrix with squares of size is an optimal mapping.p
|S||T |/r
Proof : is trivial. Each region mapped to a reducer !has output size: and input size: |S||T |/r 2
r|S||T |
r
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
![Page 26: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/26.jpg)
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
5 5 6 8 8 105668810
ST
Suppose |S| = |T| = 6 and r = 9
![Page 27: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/27.jpg)
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
5 5 6 8 8 105668810
ST
Suppose |S| = |T| = 6 and r = 9
![Page 28: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/28.jpg)
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
5 5 6 8 8 105668810
ST
Suppose |S| = |T| = 6 and r = 9
![Page 29: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/29.jpg)
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
5 5 6 8 8 105668810
ST
Suppose |S| = |T| = 6 and r = 9
MRO = 4 = 2
r|S||T |
r
MRI = 4 = |S||T |/r
![Page 30: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/30.jpg)
Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
|S| < |T |/r |S|⇥ |T |/ris the optimal mapping.
![Page 31: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/31.jpg)
Case 2: Suppose the cardinality of one dataset is significantly greater than that of the other. (WLOG, assume ). Then, rectangle cover
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
|S| < |T |/r |S|⇥ |T |/ris the optimal mapping.
(e.g., |S| = 3, |T| = 20, r = 5)
![Page 32: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/32.jpg)
Case 3: The remaining case where . !
Let , !
Then, covering M with squares is a mapping worse than an optimal mapping by a factor no greater than 4.
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
|T |/r |S| |T |
CT =
$|T |/
r|S||T |
r
%CS =
$|S|/
r|S||T |
r
%
p|S||T |/r ⇥
p|S||T |/r
![Page 33: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/33.jpg)
If |S| and/or |T| is not a multiple of , scale each !
side by and/or respectively to !
cover M. Given , we see that
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
p|S||T |/r
✓1 +
1
CS
◆ ✓1 +
1
CT
◆
|T |/r |S| |T |✓1 +
1
CS
◆r|S||T |
r 2
r|S||T |
r
![Page 34: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/34.jpg)
Hence, and
Cross Product� |S||T |/rMRO and MRI � 2
r|S||T |
r
Properties
Comparing these with the lower bounds given above, we see that the MRO and MRI produced by this mapping are at most 4 times (twice for MRI) the lower bounds.
MRI 4p
|S||T |/rMRO 4|S||T |/r
![Page 35: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/35.jpg)
Implementation• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.
• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
![Page 36: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/36.jpg)
Implementation• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.
• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
![Page 37: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/37.jpg)
Implementation• Now we know how to (nearly) optimally partition
the join matrix. So let’s run it!!
• However, when a reducer is given a record (either from S or T), it does NOT have enough information where exactly in the dataset (in which row/col) the record belongs to.
• We could run another pre-process to get that info, but it can be avoided by running a randomized algorithm!
![Page 38: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/38.jpg)
Mapping & Randomized Algorithm
Algorithm 1 : Map (Theta - Join) !Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) )
x 2 S [ T
x 2 S
![Page 39: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/39.jpg)
Mapping & Randomized Algorithm
Algorithm 1 : Map (Theta - Join) !Input : input tuple 1: if then 2: matrixRow = random(1,|S|) 3: for all regionID in lookup.getRegions(matrixRow) do 4: Output (regionID, (x, “S”) ) 5: else 6: matrixCol = random (1,|T|) 7: for all regionID in lookup.getRegions(matrixCol) do 8: Output (regionID, (x, “T”) )
x 2 S [ T
x 2 S
1. Given a record ( WLOG ) 2. Get a row uniformly randomly 3. Get all the regions intersecting that row and output ( regID, (x, S) )
x 2 S
![Page 40: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/40.jpg)
Mapping & Randomized Algorithm
5 7 7 7 8 9577899
ST
Join condition: S.value = T.value
(1) (2)
(3)
3 5 1 5 1 2
6 2 2 3 6 4
(1,S1) (2,S1) (3,S2) (1,S3) (2,S3) (3,S4) (1,S5) (2,S5) (1,S6) (2,S6) (2,T1) (3,T1) (1,T2) (3,T2) (1,T3) (3,T3) (1,T4) (3,T4) (2,T5) (3,T5) (2,T6) (3,T6)
Input Tuple
Random Row/Col Output
MapReducer 1 : key 1 (regID)Input: S1, S3, S5, S6, T2, T3, T4Output: (S3,T2) (S3,T3) (S3,T4)
Reducer 2 : key 2 (regID)Input: S1, S3, S5, S6, T1, T5, T6Output: (S1,T1) (S5,T6) (S6,T6)
Reducer 3 : key 3 (regID)Input: S2, S4, T1, T2, T3, T4, T5, T6Output: (S2,T2) (S2,T3) (S2,T4) (S4,T5)
Reduce
S1.A = 5 S2.A = 7 S3.A = 7 S4.A = 8 S5.A = 9 S6.A = 9 T1.A = 5 T2.A = 7 T3.A = 7 T4.A = 7 T5.A = 8 T6.A = 9
![Page 41: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/41.jpg)
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.
• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
![Page 42: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/42.jpg)
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.
• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
![Page 43: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/43.jpg)
Cross Product… NOT!
• We have verified that 1 Bucket Theta algorithm is close to optimal when the join condition is cross product.
• How does 1 Bucket Theta algorithm perform when join condition is NOT cross product ?
• We will compare the quality of 1 Bucket Theta algorithm to any join algorithm
![Page 44: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/44.jpg)
1BT vs ANY join algorithmLet . Any matrix to reducer mapping that has to cover at least of the cells of the join matrix, by Lemma 1, has MRI
1 � x > 0
x|S||T | |S||T |� 2
px|S||T |
[ LEMMA 1 ] A reducer that is assigned to c cells of the join matrix M will receive at least input tuples 2
pc
As we have seen, 1BT guarantees that MRI . !Hence,
4p|S||T |
MRI1BT
MRI
AnyJoinAlg
=4p
|S||T |/r2p
x|S||T |/r=
2px
![Page 45: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/45.jpg)
1BT vs ANY join algorithm
![Page 46: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/46.jpg)
1BT vs ANY join algorithm
When , the ratio < 3. !Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm.
x = 0.5
![Page 47: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/47.jpg)
1BT vs ANY join algorithm
When , the ratio < 3. !Hence,compared to ANY join algorithm that assigns more than 50% of its matrix cells to reducers, the MRI for 1BT is at most 3 times the MRI of that algorithm.
x = 0.5
![Page 48: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/48.jpg)
M-Bucket-I• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller regions would yield better MRI result.
• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
![Page 49: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/49.jpg)
M-Bucket-I• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller regions would yield better MRI result.
• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
![Page 50: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/50.jpg)
M-Bucket-I• In the previous slide, we see that instead of
covering the entire matrix, mapping smaller regions would yield better MRI result.
• Ideally, we only want to map those satisfying the join condition, but it cannot be done before knowing input statistics and/or join condition.
• M-Bucket-I exploits statistics to improve over 1 Bucket Theta join algorithm
![Page 51: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/51.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
1) With probability n /|S|, sample approx. n records from |S|
2) Build k-quantiles (k buckets), where k < n 3) Iterate through |S| and count the number of
records in each bucket 4) Do the same for |T| and build the join matrix
accordingly
![Page 52: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/52.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
![Page 53: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/53.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
![Page 54: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/54.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
![Page 55: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/55.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
Buckets
S
T
0 2 3 9
0 1 5 8 1
1
![Page 56: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/56.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S_id Value
1 7
2 2
3 4
4 2
5 1
6 9
7 10
8 2
9 5
10 3
Dataset S Dataset T
T_id Value
1 5
2 5
3 6
4 8
5 8
6 10
7 2
8 4
9 1
10 3
Sample S 7, 2, 2, 9, 2, 3
Sample T 5, 6, 8, 2, 1, 3
Samples
Buckets
S
T
0 2 3 9
0 1 5 8 1
1
4 1 4 1
1 5 3 1
![Page 57: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/57.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S STTTTTTTTTT
Join condition: S.value = T.value
![Page 58: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/58.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S STTTTTTTTTT
2 3 9
1
5
8
Join condition: S.value = T.value
![Page 59: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/59.jpg)
M-Bucket-I[ Step 1 ] Approximate Equi-Depth Histograms
S S S S S S S S S STTTTTTTTTT
2 3 9
1
5
8
Join condition: S.value = T.value
We now have candidate cells. How do we map these cells to reducers?
![Page 60: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/60.jpg)
M-Bucket-I[ Step 2 ] M-Bucket-I Algorithm
Algorithm : M-Bucket-I !Input : maxInput, r, M 1: row = 0 2: while row < M.noOfRows do 3: (row,r) = CoverSubMatrix(row, maxInput, r, M) 4: if r < 0 then!5: return false 6: return true!
![Page 61: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/61.jpg)
M-Bucket-I
Algorithm : CoverSubMatrix !Input : row_s, maxInput, r, M 1: maxScore = -1, rUsed = 0 2: for i = 1 to maxInput-1 do 3: R_i = CoverRows(row_s, row_s + i, maxInput, M) 4: area = totalCandidateArea(row_s, row_s + i, M) 5: score = area/R_i.size 6: if score >= maxScore then!7: bestRow = row_s + i 8: rUsed = R_i.size 9: r = r - rUsed 10: return (bestRow + 1, r)
[ Step 2 ] M-Bucket-I Algorithm
![Page 62: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/62.jpg)
M-Bucket-I
Algorithm : CoverRows !Input : row_f, row_l, maxInput, M 1: Regions = 0; r = newRegion() 2: for all c_i in M.getColumns do 3: if r. cap < c_i.candidateInputCosts then!4: Regions = Regions U r 5: r = newRegion() 6: r.Cells = r.Cells U c_i.candidateCells 7: return Regions
[ Step 2 ] M-Bucket-I Algorithm
![Page 63: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/63.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
[ Step 2 ] M-Bucket-I Algorithm
![Page 64: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/64.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
[ Step 2 ] M-Bucket-I Algorithm
![Page 65: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/65.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
row : 1 cost : 13/3 = 4.3
[ Step 2 ] M-Bucket-I Algorithm
![Page 66: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/66.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
row : 1 cost : 13/3 = 4.3
row : 2 cost : 22/4 = 5.5
[ Step 2 ] M-Bucket-I Algorithm
![Page 67: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/67.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
row : 1 cost : 13/3 = 4.3
row : 2 cost : 22/4 = 5.5
row : 3 cost : 31/7 = 4.428..
[ Step 2 ] M-Bucket-I Algorithm
![Page 68: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/68.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 0 cost : 4
row : 1 cost : 13/3 = 4.3
row : 2 cost : 22/4 = 5.5
row : 3 cost : 31/7 = 4.428..
We choose the mapping with highest score!
(1) (2)(3) (4)
[ Step 2 ] M-Bucket-I Algorithm
![Page 69: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/69.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
row : 3 cost : 3
(1) (2)(3) (4) So on and so forth…
[ Step 2 ] M-Bucket-I Algorithm
![Page 70: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/70.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
Final mapping!
[ Step 2 ] M-Bucket-I Algorithm
(1) (2)(3) (4)
(7)(6)(5)
(8) (9)(10)
(11) (12)(13)
![Page 71: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/71.jpg)
M-Bucket-I
Run the algorithm with r = 6 maxInput = 5
(1) (2)(3) (4)
However, we have mapped the candidate cells to > r reducers. !We do binary search until we get to the point where we a mapping to <= r reducers.(7)(6)(5)
(8) (9)(10)
(11) (12)(13)
[ Step 2 ] M-Bucket-I Algorithm
![Page 72: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/72.jpg)
M-Bucket-I[ Step 3 ] Binary Search
MaxInput = |S|+|T| = 20
Num.Reducers = 1
MaxInput = 5
Num.Reducers = 13
![Page 73: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/73.jpg)
M-Bucket-I[ Step 3 ] Binary Search
MaxInput = |S|+|T| = 20
Num.Reducers = 1
MaxInput = 5
Num.Reducers = 13
MaxInput = 12
Num.Reducers = 3
![Page 74: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/74.jpg)
M-Bucket-I[ Step 3 ] Binary Search
MaxInput = |S|+|T| = 20
Num.Reducers = 1
MaxInput = 5
Num.Reducers = 13
MaxInput = 12
Num.Reducers = 3
MaxInput = 8
Num.Reducers = 5
Since 7 reducers are required when MaxInput = 7, we stop the binary search here and output the mapping with MRI = 8.
![Page 75: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/75.jpg)
Performance1 Bucket Theta Standard Equi Join
Data set Output size (billion)
Output Imbalance
Runtime (secs)
Output Imbalance
Runtime (secs)
Synth - 0 25.00 1.0030 657 1.0124 701
Synth - 0.4 24.99 1.0023 650 1.2541 722
Synth - 0.6 24.98 1.0033 676 1.7780 923
Synth - 0.8 24.95 1.0068 678 3.0103 1482
Synth - 1 24.91 1.0089 667 5.3124 2489
Skew
ed
Where Output Imbalance = MRI
Ave.RI
MRI
Ave.RI
Skew Resistance of 1 Bucket Theta
![Page 76: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/76.jpg)
Performance1 Bucket Theta Standard Equi Join
Data set Output size (billion)
Output Imbalance
Runtime (secs)
Output Imbalance
Runtime (secs)
Synth - 0 25.00 1.0030 657 1.0124 701
Synth - 0.4 24.99 1.0023 650 1.2541 722
Synth - 0.6 24.98 1.0033 676 1.7780 923
Synth - 0.8 24.95 1.0068 678 3.0103 1482
Synth - 1 24.91 1.0089 667 5.3124 2489
Skew
ed
Where Output Imbalance = MRI
Ave.RI
MRI
Ave.RI
Skew Resistance of 1 Bucket Theta
![Page 77: Theta join (M-bucket-I algorithm explained)](https://reader031.fdocuments.in/reader031/viewer/2022021918/58ae44c31a28abad338b58cb/html5/thumbnails/77.jpg)
Performance
Step Number of Buckets
1 10 100 1000 10,000 100,000 1,000,000
M-Bucket-I cost details (seconds)
Quantiles 0 115 120 117 122 124 122
Histogram 0 140 145 147 157 167 604
Heuristic 74.01 9.21 0.84 1.50 16.67 118.03 111.27
Join 49384 10905 1157 595 548 540 536
Total 49,458.01 11,169.21 1,422.84 860.5 843.67 949.03 1,373.27