Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in...
-
Upload
isabella-baker -
Category
Documents
-
view
228 -
download
0
Transcript of Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in...
![Page 1: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/1.jpg)
Multi pass algorithms
![Page 2: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/2.jpg)
Nested-Loop joins• Tuple-Based Nested-loop Join Algorithm:
FOR each tuple s in S DO
FOR each tuple r in R DO
IF r and s join to make a tuple t THEN
output t
What’s the complexity?
![Page 3: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/3.jpg)
Block-based nested loops• Assume B(S) ≤ B(R), and B(S) > M • Read M-1 blocks of S into main memory and compare to
all of R, block by block
FOR each chunk of M-1 blocks of S DO
FOR each block b of R DO
FOR each tuple t of b DO
find the tuples of S in memory that join with t
output the join of t with each of these tuples
![Page 4: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/4.jpg)
Example• B(R) = 1000, B(S) = 500, M = 101
• Outer loop iterates 5 times• At each iteration we read M-1 (i.e. 100) blocks of S and all
of R (i.e. 1000) blocks.• Total time: 5*(100 + 1000) = 5500 I/O’s
• Question: What if we reversed the roles of R and S?• We would iterate 10 times, and in each we would read
100+500 blocks, for a total of 10*(100+500) = 6000 I/O’s.
• Compare with one-pass join, if it could be done!• We would need 1500 disk I/O’s if B(S) M-1
![Page 5: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/5.jpg)
Analysis of blocks nested loops• Number of disk I/O’s:
[B(S)/(M-1)] * (M-1 + B(R))
or
B(S) + B(S)B(R)/(M-1)
or approximately B(S)*B(R)/M
![Page 6: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/6.jpg)
Two-pass algorithms based on sorting• This special case of multi-pass algorithms is sufficient for
most of the relation sizes.
Main idea for unary operations on R • Suppose B(R) M (main memory size in blocks)
• First pass: – Read M blocks of R into Main Memory– Sort the content of Main Memory– Write the sorted result (sublist/run) into M blocks on disk.
• Second pass: create final result
![Page 7: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/7.jpg)
Duplicate elimination using sorting• In the second phase (merging) we don’t sort but copy each
tuple just once.
• We can do that because the identical tuples will show up “at the same time,” i.e. they will be all the first ones at the buffers (for the sorted sublists).
• As usual, if one buffer gets empty we refill it.
![Page 8: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/8.jpg)
Duplicate-Elimination using Sorting Example• Assume M=3, each buffer holds 2 records and relation R
consists of the following 17 tuples:
2, 5, 2, 1, 2, 2, 4, 5, 4, 3, 4, 2, 1, 5, 2, 1, 3
• After the first pass the following sorted sub-lists are created:
1, 2, 2, 2, 2, 5
2, 3, 4, 4, 4, 5
1, 1, 2, 3, 5
• In the second pass we dedicate a memory buffer to each sub-list.
![Page 9: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/9.jpg)
Example (Cont’d)
![Page 10: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/10.jpg)
Example (Cont’d)
![Page 11: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/11.jpg)
Example (Cont’d)
![Page 12: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/12.jpg)
Analysis of (R) • 2B(R) to create sorted sublists, B(R) to read each sublist in
phase 2. Total: 3B(R)
• How large can R be?– There can be no more than M sublists since we need one
buffer for each one. So, B(R)/M ≤ M, (B(R)/M is the number of sublists)
i.e. B(R) ≤ M2
• To compute (R) we need at least sqrt(B(R)) blocks of Main Memory.
![Page 13: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/13.jpg)
Sort-based , , -Example: set union.
• Analysis: 3(B(R) + B(S)) disk I/O’s• Condition: B(R) + B(S) ≤ M2
• Similar algorithms for sort based intersection and difference (bag or set versions).
• Create sorted sublists of R and S• Use input buffers for sorted sublists of R and S, one buffer
per sublist.• Output each tuple once.
- We can do that since all the identical tuples appear “at the same time.”
![Page 14: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/14.jpg)
Join• A problem for joins but not for the previous operators: The
number of joining tuples from the two relations can exceed what fits in memory.
• A solution? • Maximize the number of output buffers.• Minimize the number of sorted sublists (since we need a
buffer for each one of them).
![Page 15: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/15.jpg)
Simple sort-based join
• For R(X,Y) S(Y,Z) with M buffers of memory:• Completely: sort R on Y, sort S on Y
Merge phase• Use 2 input buffers: 1 for R, 1 for S.• Pick tuple t with smallest Y value in the buffer for R• If t doesn’t match with the first tuple in the buffer for S, then
just remove t.• Otherwise, read all the tuples from R with the same Y value
as t and put them in the M-2 part of the memory. • When the input buffer for R is exhausted fill it again and
again.
• Then, read the tuples of S that match. For each one we produce the join of it with all the tuples of R in the M-2 part of the memory.
![Page 16: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/16.jpg)
Example of sort join• B(R) = 1000, B(S) = 500, M= 101• To sort R, we need 4*B(R) I/O’s, same for S.
– Number of I/O’s = 4*(B(R) + B(S))• Doing the join in the merge phase:
– Number of I/O’s = B(R) + B(S) • Total disk I/O’s = 5*(B(R) + B(S)) = 7500
• Memory Requirement? • To be able to do the sort, we should have B(R) ≤ M2 and
B(S) ≤ M2
• Recall: for nested-loop join, we needed 5500 disk I/O’s, but the memory requirement was quadratic (it is linear, here), i.e., nested-loop join is not good for joining relations that are much larger than MM.
![Page 17: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/17.jpg)
Potential problem ...R(X , Y)----------- x1 a x2 a … xn a
S(Y, Z)--------- a z1
a z2
... a zm
What if Size of n+1 tuples > M-1 andSize of m+1 tuples> M-1?
• If the tuples from R (or S) with the same value y of Y do not fit in M-1 buffers, then we use all M-1 buffers to do a nested-loop join on the tuples with Y-value y from both relations.
• Observe that we can “smoothly” continue with the nested loop join when we see that the R tuples with Y-value y do not fit in M-1 buffers.
![Page 18: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/18.jpg)
• Do we really need the fully sorted files?• Suppose we are not worried about many common Y values
Can We Improve on Sort Join?
R
S
Join?
sorted runs
![Page 19: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/19.jpg)
A more efficient sort-based join• Suppose we are not worried about many common Y values
• Create Y-sorted sublists of R and S• Bring first block of each sublist into a buffer (assuming we
have at most M sublists)• Find smallest Y-value from heads of buffers. Join with other
tuples in heads of buffers, use other possible buffers, if there are “many” tuples with the same Y values.
• Disk I/O: 3*(B(R) + B(S))• Requirement: B(R) + B(S) ≤ M2
![Page 20: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/20.jpg)
Example• B(R) = 1000, B(S) = 500, M= 101
• Total of 15 sorted sublists• If too many tuples join on a value y, use the remaining 86
MM buffers for a one pass join on y
• Total cost: 3*(1000 + 500) = 4500 disk I/O’s• M2 =10201 > B(R) + B(S), so the requirement is satisfied
![Page 21: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/21.jpg)
Summary of sort-based algorithms
Operators Approx. M required Disk I/O
, Sqrt( B ) 3B
, , - Sqrt( B(R) + B(S) ) 3(B(R)+B(S))
Sqrt( max( B(R),B(S) ) ) 5(B(R)+B(S))
Sqrt( B(R)+B(S) ) 3(B(R)+B(S))
![Page 22: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/22.jpg)
Two-pass algorithms based on hashing
Main idea: • Instead of sorted sublists, create partitions, based on
hashing.• Second pass creates result from partitions using one pass
algorithms.
![Page 23: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/23.jpg)
Creating partitions• Partitions (buckets) are created based on all attributes of the relation
except for grouping and join, where the partitions are based on the grouping and join-attributes respectively.
• Why bucketize? Tuples with “matching” values end up in the same bucket.
Initialize M-1 buckets using M-1 empty buffers;FOR each block b of relation R DO
read block b into the M-th buffer; FOR each tuple t in b DO
IF the buffer for bucket h(t) has no room for t THENcopy the buffer to disk;initialize a new empty block in that buffer;
copy t to the buffer for bucket h(t);ENDIF;
ENDFOR;FOR each bucket DO
IF the buffer for this bucket is not empty THENwrite the buffer to disk;
![Page 24: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/24.jpg)
Hash-based duplicate elimination• Pass 1: create partitions by hashing on all attributes• Pass 2: for each partition, use the one-pass method for
duplicate elimination
• Cost: 3B(R) disk I/O’s
• Requirement: B(R) ≤ M*(M-1)
(B(R)/(M-1) is the approximate size of one bucket)
i.e. the req. is approximately B(R) ≤ M2
![Page 25: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/25.jpg)
Hash-based grouping and aggregation
• Pass 1: create partitions by hashing on grouping attributes• Pass 2: for each partition, use one-pass method.
• Cost: 3B(R), Requirement: B(R) ≤ M2
• More exactly the requirement is:
MM
RB L )1(
))(((
![Page 26: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/26.jpg)
Hash-based set union• Pass 1: create partitions R1,…,RM-1 of R, and S1,…,SM-1 of S
(with the same hash function)• Pass 2: for each pair Ri, Si compute Ri Si using the one-
pass method.
• Cost: 3(B(R) + B(S))• Requirement: min(B(R),B(S)) ≤ M2
• Similar algorithms for intersection and difference (set and bag versions)
![Page 27: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/27.jpg)
Partition hash-join
• Pass 1: create partitions R1, ..,RM-1 of R, and S1, ..,SM-1 of S, based on the join attributes (the same hash function for both R and S)
• Pass 2: for each pair Ri, Si compute Ri Si using the one-pass method.
![Page 28: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/28.jpg)
• B(R) = 1000 blocks• B(S) = 500 blocks• Memory available = 101 blocks• R S on common attribute C• Use 100 buckets
– Read R– Hash– Write buckets
Example
...
...
10 blocks
100R
Same for S
![Page 29: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/29.jpg)
• Read one R bucket• Build memory hash table• Read corresponding S bucket block by block.
RS
...
R
Memory...
Cost• “Bucketize:”
– Read + write R– Read + write S
• Join– Read R– Read S
Total cost = 3*[1000+500] = 4500
In general:
Cost: 3(B(R) + B(S))
Req.: min(B(R),B(S)) ≤ M2
![Page 30: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/30.jpg)
Summary of hash-based methods
Operators Approx. M required
Disk I/O
, Sqrt(B) 3B
, , - Sqrt(B(S)) 3(B(R)+B(S))
Sqrt(B(S)) 3(B(R)+B(S))
![Page 31: Multi pass algorithms. Nested-Loop joins Tuple-Based Nested-loop Join Algorithm: FOR each tuple s in S DO FOR each tuple r in R DO IF r and s join to.](https://reader036.fdocuments.in/reader036/viewer/2022062315/5697bfa61a28abf838c98801/html5/thumbnails/31.jpg)
Sort vs. Hash based algorithms• Hash-based algorithms have a size requirement that
depends only on the smaller of the two arguments rather than on the sum of the argument sizes, as for sort-based algorithms.
• Sort-based algorithms allow us to produce the result in sorted order and take advantage of that sort later. The result can be used in another sort-based algorithm later.
• Hash-based algorithms depend on the buckets being of nearly equal size. Well, what about a join with a very few values for the join attribute…