CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

30
CS186 Week 0 Out of Core Algorithms

Transcript of CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Page 1: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

CS186 Week 0

Out of Core Algorithms

Page 2: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Today

• External Merge Sort• External Hashing

Page 3: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Sorting

• Goal: minimize number of I/Os (especially “random” I/Os)

• Classic interview question: how to sort if data don’t fit in memory?

Page 4: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

But first, what is a sorted run?(name = Bob, sid = 1)(name = Jill, sid = 2)

(name = Sam, sid = 3)

(name = Sue, sid = 6)(name = Kev, sid = 8)(name = Jack, sid = 9)

(name = Joe, sid = 10)(name = Sid, sid = 12)(name = Sal, sid = 15)

(name = Bit, sid = 1)(name = Bat, sid = 2)(name = Tam, sid = 3)

(name = Foo, sid = 6)(name = Bar, sid = 8)

(name = Bam, sid = 9)

(name = Ke, sid = 10)(name = Kay, sid = 12)(name = Al, sid = 15)

A sorted subset of a table.

Another common interview question: How to sort a bunch of sorted sublists into one list?

Page 5: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Sorting: 2-Way

RAM

I/OBuffer

sortOUTPUTINPUT

• Pass 0 (conquer): – read a page, sort it, write it.– only one buffer page is used– a repeated “batch job”

Page 6: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Sorting: 2-Way• Pass 0 (conquer):

– read a page, sort it, write it.– only one buffer page is used– a repeated “batch job”

• Pass 1, 2, 3, …, etc. (merge):– requires 3 buffer pages

• note: this has nothing to do with double buffering!

– merge pairs of runs into runs twice as long– a streaming algorithm, as in the previous slide!

INPUT 1

INPUT 2

OUTPUT

RAM

Page 7: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Two-Way External Merge Sort

• Conquer and Merge: sort subfiles and merge

• Each pass we read + write each page in file.

• N pages in the file. So, the number of passes is:

• So total cost is:

• Why 2N * num passes ?

Input file

1-page runs

2-page runs

4-page runs

8-page runs

PASS 0

PASS 1

PASS 2

PASS 3

9

3,4 6,2 9,4 8,7 5,6 3,1 2

3,4 5,62,6 4,9 7,8 1,3 2

2,34,6

4,78,9

1,35,6 2

2,34,46,78,9

1,23,56

1,22,33,44,56,67,8

Page 8: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Merging Runs

General External Merge Sort• More than 3 buffer pages. How can we utilize them?• To sort a file with N pages using B buffer pages:

– Pass 0: use B buffer pages. Produce sorted runs of B pages each.

– Pass 1, 2, …, etc.: merge B-1 runs.

INPUT 1

INPUT B-1

OUTPUT

Disk

INPUT 2

. . .

RAM

Page 9: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Cost of External Merge Sort

• Number of passes:• Cost = 2N * (# of passes)

– Why?

• How big of a table can we sort in two passes?– Each “sorted run” after Phase 0 is of size B– Can merge up to B-1 sorted runs in Phase 1

• Answer: B(B-1).– Sort N pages of data in about sqrt(N) space

Page 10: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

HASHING

Page 11: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Cats by fur color (Hashing)

Black cats…

Grey cats…

Orange cats… White cats…Zorro cats…

Page 12: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• Goal: Group kitties by fur color so we can d’aww them.

• Setup: 12 kitties, 2 can fit per page. We have 8 kitties worth of memory.

• N =• B =

64

Page 13: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.)

Page 14: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.)

How to assign cats to partitions?Hashing!What does that mean?Map each fur color to a bucket. {B, G, O, W, Z} -> {1, 2, 3}What hash function?Let’s say, we’ll map each color towhichever THIRD of the alphabetthe first letter lies in.{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 15: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.)

How to assign cats to partitions?Hashing!What does that mean?Map each fur color to a bucket. {B, G, O, W, Z} -> {1, 2, 3}What hash function?Let’s say, we’ll map each color towhichever THIRD of the alphabetthe first letter lies in.{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 16: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:

{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 17: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:

{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 18: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:

{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 19: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:

{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 20: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:

{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 21: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:

{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 22: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition

– Create partitions on disk such that all kitties of a particular fur color are guaranteed to be within the same partition. (Though there may not be a whole partition for each color.) Hash function:

{B, G} -> 1; {O} -> 2, {W, Z} -> 3.

Page 23: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition• Step 2: Re-Hash

– Create in-memory table for each partition

Page 24: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition• Step 2: Re-Hash

– Create in-memory table for each partition

Page 25: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Hashing: How To

• N = 6, B = 4• Step 1: Partition• Step 2: Re-Hash

– Create in-memory hash table for each partition

Grey ->

Black ->

Page 26: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Two Phases

• Partition:(Divide)

• Rehash:(Conquer)

PartitionsHash table for partition

Ri (k <= B pages)

B main memory buffersDisk

Result

hashfnhr

B main memory buffers DiskDisk

Original Relation OUTPUT

2INPUT

1

hashfunction

hp B-1

Partitions

1

2

B-1

. . .

Page 27: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Memory Requirement

• How big of a table can we hash in two passes?– B-1 “partitions” result from Pass 1– Each should be no more than B pages in size– Answer: B(B-1).

• We can hash a table of size N pages in about space

– Note: assumes hash function distributes records evenly!

• Have a bigger table? Recursive partitioning!

Page 28: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Cost of External Hashing

cost = 4*N IO’s

Cost of External Sorting

Divide ConquerConquer Merge

Page 29: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.

Summary

• Sort/Hash Duality– Hashing is Divide & Conquer– Sorting is Conquer & Merge

• Sorting is overkill for rendezvous– But sometimes a win anyhow

• Sorting sensitive to internal sort alg– Quicksort vs. HeapSort– In practice, QuickSort tends to win

• Don’t forget double buffering

Page 30: CS186 Week 0 Out of Core Algorithms. Today External Merge Sort External Hashing.