CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001.

CSE 326Hashing

David Kaplan

Dept of Computer Science & EngineeringAutumn 2001

HashingCSE 326 Autumn 2001

2

Reminder: Dictionary ADTDictionary operations

insert find delete create destroy

Stores values associated with user-specified keys

values may be any (homogeneous) type

keys may be any (homogeneous) comparable type

AdrienRoller-blade

demon

HannahC++ guru

DaveOlder than dirt

…

insert

find(Adrien) Adrien Roller-blade demon

Donald l33t haxtor


3

Dictionary Implementations So Far

Insert Find Delete

Unsorted list O(1) O(n) O(n)

Trees O(log n)

O(log n)

O(log n)

Sorted array O(n) O(log n)

O(n)

Array special caseknown keys {1, … ,

K}

O(1) O(1) O(1)


4

ADT Legalities:

A Digression on KeysMethods are the contract between an ADT and the outside agent (client code)

Ex: Dictionary contract is {insert, find, delete} Ex: Priority Q contract is {insert, deleteMin}

Keys are the currency used in transactions between an outside agent and ADT

Ex: insert(key), find(key), delete(key)

So … How about O(1) insert/find/delete for any key type?


5

Hash Table Goal:

Key as IndexWe can access a record as a[5]

We want to access a record as a[“Hannah”]

Adrienroller-blade demon2

HannahC++ guru5

Adrienroller-blade demonAdrien

HannahC++ guruHannah


6

Hash Table Approach

But… is there a problem with this pipe-dream?

f(x)

Hannah

Dave

Adrien

Donald

Ed


7

Hash Table Dictionary Data StructureHash function: maps keys to integers

Result: Can quickly find the right

spot for a given entry

Unordered and sparse tableResult:

Cannot efficiently list all entries

Cannot efficiently find min, max, ordered ranges

f(x)Hannah

DaveAdrienDonald

Ed


8

Hash Table Taxonomy

f(x)

Hannah

Dave

Adrien

Donald

Ed

hash function

collision

keys

load factor = # of entries in table

tableSize


9

Agenda:

Hash Table Design Decisions What should the hash function be?

What should the table size be?

How should we resolve collisions?


10

Hash FunctionHash function maps a key to a table

indexValue & find(Key & key) { int index = hash(key) % tableSize; return Table[index];}


11

What Makes A Good Hash Function?

Fast runtime O(1) and fast in practical terms

Distributes the data evenly hash(a) % size hash(b) % size

Uses the whole hash table for all 0 i < size, k such that hash(k) %

size = i


12

Good Hash Function for Integer KeysChoose

tableSize is prime hash(n) = n

Example: tableSize = 7

insert(4)insert(17)find(12)insert(9)delete(17)

3

2

1

0

6

5

4


13

Good Hash Function for Strings?Let s = s1s2s3s4…sn: choose

hash(s) = s1 + s2128 + s31282 + s41283 + … + sn128n

Think of the string as a base 128 (aka radix 128) number

Problems: hash(“really, really big”) = well… something really,

really big

hash(“one thing”) % 128 = hash(“other thing”) % 128


14

String Hashing

Issues and TechniquesMinimize collisions

Make tableSize and radix relatively primeTypically, make tableSize not a multiple of 128

Simplify computation Use Horner’s Ruleint hash(String s) { h = 0; for (i = s.length() - 1; i >= 0; i--) { h = (s[i] + 128*h) % tableSize; } return h; }


15

Good Hashing:

Multiplication MethodHash function is defined by size plus a parameter A

hA(k) = size * (k*A mod 1) where 0 < A < 1

Example: size = 10, A = 0.485hA(50) = 10 * (50*0.485 mod 1)

= 10 * (24.25 mod 1) = 10 * 0.25 = 2

no restriction on size! when building a static table, we can try several values of

A more computationally intensive than a single mod


16

Hashing DilemmaSuppose your Worst Enemy 1) knows your hash function; 2) gets to decide which keys to send you?

Faced with this enticing possibility, Worst Enemy decides to:a) Send you keys which maximize collisions for your hash

function.b) Take a nap.

Moral: No single hash function can protect you!

Faced with this dilemma, you:a) Give up and use a linked list for your Dictionary.b) Drop out of software, and choose a career in fast foods.c) Run and hide.d) Proceed to the next slide, in hope of a better alternative.


17

Universal Hashing1

Suppose we have a set K of possible keys, and a finite set H of hash functions that map keys to entries in a hashtable of size m.

1Motivation: see previous slide (or visit http://www.burgerking.com/jobs)

Definition: H is a universal collection of hash functions if and only if …

For any two keys k1, k2 in K, there are at most |H|/m functions in H for which h(k1) = h(k2).

So … if we randomly choose a hash function from H, our chances of collision are no more than if we get to choose hash table entries at random!

01

.

.

.

m-1K

H

h

hi

hj

k2

k1

http://www.burgerking.com/jobs




18

Random Hashing – Not!How can we “randomly choose a hash function”?

Certainly we cannot randomly choose hash functions at runtime, interspersed amongst the inserts, finds, deletes! Why not?

We can, however, randomly choose a hash function each time we initialize a new hashtable.

Conclusions Worst Enemy never knows which hash function we will

choose – neither do we! No single input (set of keys) can always evoke worst-case

behavior


19

Good Hashing:Universal Hash Function A (UHFa)

Parameterized by prime table size and vector:a = <a0 a1 … ar> where 0 <= ai < size

Represent each key as r + 1 integers where ki < size

size = 11, key = 39752 ==> <3,9,7,5,2> size = 29, key = “hello world” ==>

<8,5,12,12,15,23,15,18,12,4>

ha(k) = sizekar

iii mod

0


20

UHFa: Example Context: hash strings of length 3 in a table of

size 131

let a = <35, 100, 21>ha(“xyz”) = (35*120 + 100*121 + 21*122) %

131 = 129


21

Thinking about UHFa

Strengths: works on any type as long as you can form ki’s

if we’re building a static table, we can try many values of the hash vector <a>

random <a> has guaranteed good properties no matter what we’re hashing

Weaknesses must choose prime table size larger than any ki


22

Good Hashing:Universal Hash Function 2 (UHF2)

Parameterized by j, a, and b: j * size should fit into an int a and b must be less than size

hj,a,b(k) = ((ak + b) mod (j*size))/j


23

UHF2 : ExampleContext: hash integers in a table of size 16

let j = 32, a = 100, b = 200hj,a,b(1000) = ((100*1000 + 200) % (32*16)) / 32

= (100200 % 512) / 32 = 360 / 32 = 11


24

Thinking about UHF2

Strengths if we’re building a static table, we can try many

parameter values random a,b has guaranteed good properties no

matter what we’re hashing can choose any size table very efficient if j and size are powers of 2

(why?)

Weaknesses need to turn non-integer keys into integers


25

Hash Function SummaryGoals of a hash function

reproducible mapping from key to table index evenly distribute keys across the table separate commonly occurring keys (neighboring keys?) fast runtime

Some hash function candidates h(n) = n % size h(n) = string as base 128 number % size Multiplication hash: compute percentage through the table Universal hash function A: dot product with random vector Universal hash function 2: next pseudo-random number


26

Hash Function Design Considerations Know what your keys are Study how your keys are distributed Try to include all important information

in a key in the construction of its hash Try to make “neighboring” keys hash to

very different places Prune the features used to create the

hash until it runs “fast enough” (very application dependent)


27

Handling CollisionsPigeonhole principle says we can’t avoid all collisions

try to hash without collision n keys into m slots with n > m try to put 6 pigeons into 5 holes

What do we do when two keys hash to the same entry? Separate Chaining: put a little dictionary in each entry Open Addressing: pick a next entry to try within hashtable

Terminology madness :-( Separate Chaining sometimes called Open Hashing Open Addressing sometimes called Closed Hashing


28

3

2

1

0

6

5

4

a d

e b

c

Separate ChainingPut a little dictionary at each entry

Commonly, unordered linked list (chain)

Or, choose another Dictionary type as appropriate (search tree, hashtable, etc.)

Properties can be greater than 1 performance degrades with length

of chains Alternate Dictionary type (e.g.

search tree, hashtable) can speed up secondary search

h(a) = h(d)h(e) = h(b)


29

Separate Chaining Code

[private]

Dictionary & findBucket(const Key & k) {

return table[hash(k)%table.size];

}

void insert(const Key & k, const Value & v) {

findBucket(k).insert(k,v);

}

Value & find(const Key & k) { return findBucket(k).find(k);}

void delete(const Key & k) { findBucket(k).delete(k);}


30

Load Factor in Separate ChainingSearch cost

unsuccessful search:

successful search:

Desired load factor:


31

Open AddressingAllow one key at each table entry

two objects that hash to the same spot can’t both go there

first one there gets the spot next one must go in another spot

Properties 1 performance degrades with

difficulty of finding right spot

a

c

e3

2

1

0

6

5

4

h(a) = h(d)h(e) = h(b)

d

b


32

ProbingRequires collision resolution function f(i)

Probing how to: First probe - given a key k, hash to h(k) Second probe - if h(k) is occupied, try h(k) + f(1) Third probe - if h(k) + f(1) is occupied, try h(k) + f(2) And so forth

Probing properties we force f(0) = 0 ith probe is to (h(k) + f(i)) mod size if i reaches size - 1, the probe has failed depending on f(), the probe may fail sooner long sequences of probes are costly!


33

Linear Probingf(i) = iProbe sequence is

h(k) mod size h(k) + 1 mod size h(k) + 2 mod size …

bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash(k); do { entry = &table[probePoint]; probePoint = (probePoint + 1) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty();}

Linear Probing Example

probes:

47

93

40

103

2

1

0

6

5

4

insert(55)55%7 = 6

3

76

3

2

1

0

6

5

4

insert(76)76%7 = 6

1

76

3

2

1

0

6

5

4

insert(93)93%7 = 2

1

93

76

3

2

1

0

6

5

4

insert(40)40%7 = 5

1

93

40

76

3

2

1

0

6

5

4

insert(47)47%7 = 5

3

47

93

40

76

103

2

1

0

6

5

4

insert(10)10%7 = 3

1

55

76

93

40

47


35

Load Factor in Linear ProbingFor any < 1, linear probing will find an empty slotSearch cost (for large table sizes)

successful search:


Linear probing suffers from primary clusteringPerformance quickly degrades for > 1/2

21

11

2

1

1

11

2

1


36

Quadratic Probingf(i) = i2

Probe sequence: h(k) mod size h(k) + 1 mod size h(k) + 4 mod size h(k) + 9 mod size …

bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash(k), i = 0; do { entry = &table[probePoint]; i++; probePoint = (probePoint + (2*i - 1)) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty();}

Good Quadratic Probing Example

probes:

76

3

2

1

0

6

5

4

insert(76)76%7 = 6

1

76

3

2

1

0

6

5

4

insert(40)40%7 = 5

1

40 40

76

3

2

1

0

6

5

4

insert(48)48%7 = 6

2

48 47

40

76

3

2

1

0

6

5

4

insert(5)5%7 = 5

3

5 5

40

553

2

1

0

6

5

4

insert(55)55%7 = 6

3

76

47

Bad Quadratic Probing Example

probes:

76

3

2

1

0

6

5

4

insert(76)76%7 = 6

1

35

93

40

76

3

2

1

0

6

5

4

insert(47)47%7 = 5

76

3

2

1

0

6

5

4

insert(93)93%7 = 2

1

93 93

76

3

2

1

0

6

5

4

insert(40)40%7 = 5

1

40

93

40

76

3

2

1

0

6

5

4

insert(35)35%7 = 0

1

35


39

Quadratic Probing Succeeds for ½If size is prime and ½, then quadratic probing will find an empty slot in size/2 probes or fewer.

show for all 0 i, j size/2 and i j(h(x) + i2) mod size (h(x) + j2) mod size

by contradiction: suppose that for some i, j:(h(x) + i2) mod size = (h(x) + j2) mod sizei2 mod size = j2 mod size(i2 - j2) mod size = 0[(i + j)(i - j)] mod size = 0

but how can i + j = 0 or i + j = size when

i j and i,j size/2? same for i - j mod size = 0


40

Quadratic Probing May Failfor > ½ For any i larger than size/2, there is

some j smaller than i that adds with i to equal size (or a multiple of size). D’oh!


41

Load Factor in Quadratic Probing For any ½, quadratic probing will find

an empty slot For > ½, quadratic probing may find a

slot Quadratic probing does not suffer from

primary clustering Quadratic probing does suffer from

secondary clustering How could we possibly solve this?


42

Double HashingDouble Hashingf(i) = i*hash2(k)Probe sequence:

h1(k) mod size (h1(k) + 1 h2(x)) mod size (h1(k) + 2 h2(x)) mod size …

bool findEntry(const Key & k, Entry *& entry) { int probePoint = hash1(k), delta = hash2(k); do { entry = &table[probePoint]; probePoint = (probePoint + delta) % size; } while (!entry->isEmpty() && entry->key != k); return !entry->isEmpty();}


43

A Good Double Hash Function… …is quick to evaluate.…differs from the original hash function.…never evaluates to 0 (mod size).

One good choice:Choose a prime p < sizeLet hash2(k)= p - (k mod p)

Double HashingDouble Hashing Example (p=5)

probes:

93

55

40

103

2

1

0

6

5

4

insert(55)55%7 = 6

5 - (55%5) = 5

2

76

3

2

1

0

6

5

4

insert(76)76%7 = 6

1

76

3

2

1

0

6

5

4

insert(93)93%7 = 2

1

93

76

3

2

1

0

6

5

4

insert(40)40%7 = 5

1

93

40

76

3

2

1

0

6

5

4

insert(47)47%7 = 5

5 - (47%5) = 3

2

47

93

40

76

103

2

1

0

6

5

4

insert(10)10%7 = 3

1

47

76

93

40

47


45

Load Factor in Double HashingFor any < 1, double hashing will find an empty slot (given appropriate table size and hash2)

Search cost appears to approach optimal (random hash):

successful search:


No primary clustering and no secondary clustering

One extra hash calculation

1

1 1

1ln

1


46

0

1

2

73

2

1

0

6

5

4

delete(2)

0

1

73

2

1

0

6

5

4

find(7)

Where is it?!

Deletion in Open Addressing

Must use lazy deletion! On insertion, treat a (lazily)

deleted item as an empty slot


47

The Squished Pigeon Principle Insert using Open Addressing cannot work with

1. Insert using Open Addressing with quadratic

probing may not work with ½. With Separate Chaining or Open Addressing,

large load factors lead to poor performance!

How can we relieve the pressure on the pigeons? Hint: what happens when we overrun array storage in

a {queue, stack, heap}? What else must happen with a hashtable?


48

RehashingWhen the gets “too large” (over some constant threshold), rehash all elements into a new, larger table:

takes O(n), but amortized O(1) as long as we (just about) double table size on the resize

spreads keys back out, may drastically improve performance

gives us a chance to retune parameterized hash functions

avoids failure for Open Addressing techniques allows arbitrarily large tables starting from a small table clears out lazily deleted items


49

Case StudySpelling dictionary

30,000 words static arbitrary(ish)

preprocessing time

Goals fast spell checking minimal storage

Practical notes almost all searches

are successful – Why? words average about

8 characters in length 30,000 words at 8

bytes/word ~ .25 MB pointers are 4 bytes there are many

regularities in the structure of English words


50

Case Study:

Design ConsiderationsPossible Solutions

sorted array + binary search Separate Chaining Open Addressing + linear probing

Issues Which data structure should we use? Which type of hash function should we use?


51

Case Study:

StorageAssume words are strings and entries are pointers to strings

Array +binary search Separate Chaining

…

Open addressing

How many pointers does each use?


52

Case Study:

Analysisstorage time

Binary searchn pointers + words = 360KB

log2n 15 probes per access, worst case

Separate Chainingn + n/ pointers + words( = 1 600KB)

1 + /2 probes per access on average( = 1 1.5 probes)

Open Addressingn/ pointers + words( = 0.5 480KB)

(1 + 1/(1 - ))/2 probes per access on average

( = 0.5 1.5 probes)

What to do, what to do? …


53

Perfect HashingWhen we know the entire key set in

advance … Examples: programming language

keywords, CD-ROM file list, spelling dictionary, etc.

… then perfect hashing lets us achieve: Worst-case O(1) time complexity! Worst-case O(n) space complexity!


54

Perfect Hashing Technique Static set of n known keys Separate chaining, two-level

hash Primary hash table size=n jth secondary hash table size=nj

2

(where nj keys hash to slot j in primary hash table)

Universal hash functions in all hash tables

Conduct (a few!) random trials, until we get collision-free hash functions

3

2

1

0

6

5

4

Primary hash table

Secondary hash tables


55

Perfect Hashing Theorems1

Theorem: If we store n keys in a hash table of size n2 using a randomly chosen universal hash function, then the probability of any collision is < ½.

Theorem: If we store n keys in a hash table of size m=n using a randomly chosen universal hash function, then

where nj is the number of keys hashing to slot j.

Corollary: If we store n keys in a hash table of size m=n using a randomly chosen universal hash function and we set the size of each secondary hash table to mj=nj

2, then:a)The expected amount of storage required for all secondary hash tables is less than

2n.b)The probability that the total storage used for all secondary hash tables exceeds 4n

is less than ½.

nEm

jjn 2

1

0

2

1Intro to Algorithms, 2nd ed. Cormen, Leiserson, Rivest, Stein


56

Perfect Hashing ConclusionsPerfect hashing theorems set tight expected bounds on sizes and collision behavior of all the hash tables (primary and all secondaries).

Conduct a few random trials of universal hash functions, by simply varying UHF parameters, until we get a set of UHFs and associated table sizes which deliver …

Worst-case O(1) time complexity! Worst-case O(n) space complexity!


57

Extendible Hashing:

Cost of a Database Query

I/O to CPU ratio is 300-to-1!


58

Extendible HashingHashing technique for huge data sets

optimizes to reduce disk accesses each hash bucket fits on one disk block better than B-Trees if order is not important – why?

Table contains buckets, each fitting in one disk block, with the

data a directory that fits in one disk block is used to

hash to the correct bucket


59

001 010 011 110 111 101

Extendible Hash Table Directory entry: key prefix (first k bits) and a pointer to the

bucket with all keys starting with its prefix Each block contains keys matching on first j k bits, plus

the data associated with each key

000 100

(2)00001000110010000110

(2)010010101101100

(3)1000110011

(3)101011011010111

(2)11001110111110011110

directory for k = 3

Inserting (easy case)

001 010 011 110 111 101000 100

(2)00001000110010000110

(2)010010101101100

(3)1000110011

(3)101011011010111

(2)11001110111110011110

insert(11011)

001 010 011 110 111 101000 100

(2)00001000110010000110

(2)010010101101100

(3)1000110011

(3)101011011010111

(2)110011110011110

Splitting a Leaf

001 010 011 110 111 101000 100

(2)00001000110010000110

(2)010010101101100

(3)1000110011

(3)101011011010111

(2)11001110111110011110

insert(11000)

001 010 011 110 111 101000 100

(2)00001000110010000110

(2)010010101101100

(3)1000110011

(3)101011011010111

(3)110001100111011

(3)1110011110


62

Splitting the Directory1. insert(10010)

But, no room to insert and no adoption!

2. Solution: Expand directory

3. Then, it’s just a normal split.

01 10 1100

(2)01101

(2)10000100011001110111

(2)1100111110

001 010 011 110 111 101000 100


63

If Extendible Hashing Doesn’t Cut ItStore only pointers to the items

+ (potentially) much smaller M+ fewer items in the directory– one extra disk access!

Rehash+ potentially better distribution over the buckets+ fewer unnecessary items in the directory– can’t solve the problem if there’s simply too much data

What if these don’t work? use a B-Tree to store the directory!


64

Hash WrapCollision resolution•Separate Chaining

Expand beyond hashtable via secondary Dictionaries

Allows > 1•Open Addressing

Expand within hashtable Secondary probing:

{linear, quadratic, double hash}

1 (by definition!) ½ (by preference!)

Rehashing Tunes up hashtable when

crosses the line

Hash functions Simple integer hash: prime

table size Multiplication method Universal hashing guarantees

no (always) bad input

Perfect hashing Requires known, fixed keyset Achieves O(1) time, O(n)

space - guaranteed!

Extendible hashing For disk-based data Combine with b-tree

directory if needed

CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001.

Documents

Transcript of CSE 326 Hashing David Kaplan Dept of Computer Science & Engineering Autumn 2001.