CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.
-
Upload
anna-hodge -
Category
Documents
-
view
215 -
download
0
Transcript of CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.
CS186 Week 0
Out of Core Algorithms
Today
• External Merge Sort• External Hashing
Sorting
• Goal: minimize number of I/Os (especially “random” I/Os)
• Classic interview question: how to sort if data don’t fit in memory?
But first, what is a sorted run?(name = Bob, sid = 1)(name = Jill, sid = 2)
(name = Sam, sid = 3)
(name = Sue, sid = 6)(name = Kev, sid = 8)(name = Jack, sid = 9)
(name = Joe, sid = 10)(name = Sid, sid = 12)(name = Sal, sid = 15)
(name = Bit, sid = 1)(name = Bat, sid = 2)(name = Tam, sid = 3)
(name = Foo, sid = 6)(name = Bar, sid = 8)
(name = Bam, sid = 9)
(name = Ke, sid = 10)(name = Kay, sid = 12)(name = Al, sid = 15)
A sorted subset of a table.
Another common interview question: How to sort a bunch of sorted sublists into one list?
Sorting: 2-Way
RAM
I/OBuffer
sortOUTPUTINPUT
• Pass 0 (conquer): – read a page, sort it, write it.– only one buffer page is used– a repeated “batch job”
Sorting: 2-Way• Pass 0 (conquer):
– read a page, sort it, write it.– only one buffer page is used– a repeated “batch job”
• Pass 1, 2, 3, …, etc. (merge):– requires 3 buffer pages
• note: this has nothing to do with double buffering!
– merge pairs of runs into runs twice as long– a streaming algorithm, as in the previous slide!
INPUT 1
INPUT 2
OUTPUT
RAM
Two-Way External Merge Sort
• Conquer and Merge: sort subfiles and merge
• Each pass we read + write each page in file.
• N pages in the file. So, the number of passes is:
• So total cost is:
• Why 2N * num passes ?
Input file
1-page runs
2-page runs
4-page runs
8-page runs
PASS 0
PASS 1
PASS 2
PASS 3
9
3,4 6,2 9,4 8,7 5,6 3,1 2
3,4 5,62,6 4,9 7,8 1,3 2
2,34,6
4,78,9
1,35,6 2
2,34,46,78,9
1,23,56
1,22,33,44,56,67,8
Merging Runs
General External Merge Sort• More than 3 buffer pages. How can we utilize them?• To sort a file with N pages using B buffer pages:
– Pass 0: use B buffer pages. Produce sorted runs of B pages each.
– Pass 1, 2, …, etc.: merge B-1 runs.
INPUT 1
INPUT B-1
OUTPUT
Disk
INPUT 2
. . .
RAM
Cost of External Merge Sort
• Number of passes:• Cost = 2N * (# of passes)
– Why?
• How big of a table can we sort in two passes?– Each “sorted run” after Phase 0 is of size B– Can merge up to B-1 sorted runs in Phase 1
• Answer: B(B-1).– Sort N pages of data in about sqrt(N) space
HASHING
Cats by fur color (Hashing)
Black cats…
Grey cats…
Orange cats… White cats…Zorro cats…
Hashing: How To
• Goal: Group kitties by fur color so we can d’aww them.
• Setup: 12 kitties, 2 can fit per page. We have 8 kitties worth of memory.
• N =• B =
64
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.)
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.)
How to assign cats to partitions?Hashing!What does that mean?Map each fur color to a bucket. {B, G, O, W, Z} -> {1, 2, 3}What hash function?Let’s say, we’ll map each color towhichever THIRD of the alphabetthe first letter lies in.{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.)
How to assign cats to partitions?Hashing!What does that mean?Map each fur color to a bucket. {B, G, O, W, Z} -> {1, 2, 3}What hash function?Let’s say, we’ll map each color towhichever THIRD of the alphabetthe first letter lies in.{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:
{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:
{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:
{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:
{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:
{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:
{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition
– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:
{B, G} -> 1; {O} -> 2, {W, Z} -> 3.
Hashing: How To
• N = 6, B = 4• Step 1: Partition• Step 2: Re-Hash
– Create in-memory table for each partition
Hashing: How To
• N = 6, B = 4• Step 1: Partition• Step 2: Re-Hash
– Create in-memory table for each partition
Hashing: How To
• N = 6, B = 4• Step 1: Partition• Step 2: Re-Hash
– Create in-memory hash table for each partition
Grey ->
Black ->
Two Phases
• Partition:(Divide)
• Rehash:(Conquer)
PartitionsHash table for partition
Ri (k <= B pages)
B main memory buffersDisk
Result
hashfnhr
B main memory buffers DiskDisk
Original Relation OUTPUT
2INPUT
1
hashfunction
hp B-1
Partitions
1
2
B-1
. . .
Memory Requirement
• How big of a table can we hash in two passes?– B-1 “partitions” result from Pass 1– Each should be no more than B pages in size– Answer: B(B-1).
• We can hash a table of size N pages in about space
– Note: assumes hash function distributes records evenly!
• Have a bigger table? Recursive partitioning!
Cost of External Hashing
cost = 4*N IO’s
Cost of External Sorting
Divide ConquerConquer Merge
Summary
• Sort/Hash Duality– Hashing is Divide & Conquer– Sorting is Conquer & Merge
• Sorting is overkill for rendezvous– But sometimes a win anyhow
• Sorting sensitive to internal sort alg– Quicksort vs. HeapSort– In practice, QuickSort tends to win
• Don’t forget double buffering