Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

56
Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1

Transcript of Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Page 1: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

1

Hash Tables

Universal Families of Hash Functions

Bloom Filters

Wednesday, July 23rd

Page 2: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

2

Outline For Today

1. Hash Tables and Universal Hashing

2. Bloom Filters

Page 3: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

3

Outline For Today

1. Hash Tables and Universal Hashing

2. Bloom Filters

Page 4: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

4

Hash TablesRandomized Data Structure

Implementing the “Dictionary” abstract data type

(ADT)

Insert

Delete

Lookup

We’ll assume no duplicates

Applications

Symbol Tables/Compilers: which variables already

declared?

ISPs: is IP address spam/blacklisted?

many others

Page 5: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

5

SetupUniverse U of all possible elements

All possible 232 IP addresses

All possible variables that can be declared

Maintain a possibly evolving subset S ⊆ U

|S|=m and |U| >> m

S might be evolving

Page 6: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

6

Naïve Dictionary Implementations (1)Option 1: Bit Vectors

An array A, keeping one bit 0/1 for each element of

U

Insert element i => A[i] = 1

Delete element i => A[i] = 0

Lookup element i => return A[i] == 1

Time complexity of every operation is O(1)

Space: O(U) (e.g. 232 for IP addresses)

Quick but Not Scalable!

Page 7: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

7

Naïve Dictionary Implementations (2)Option 2: Linked List

One entry for each element in S

Insert element i => check if i exists, if not append to

list

Delete element i => find i in the list and remove

Lookup element i => go through the entire list

Time complexity of every operation is O(|S|)

Space: O(S) space

Scalable but Not Quick!

Page 8: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

8

Hash Tables: Best of Both WorldsRandomized Dictionary that is:

Quick: O(1) expected time for each operation

Scalable: O(S) space

Page 9: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

9

Hash Tables: High-level IdeaBuckets: distinct locations in the hash table

Let n be # buckets

n ≈ m (recall m =|S|)

i.e., Load factor: m/n = O(1)

Hash Function h: U -> {0, 1, …, n-1}

We store each element x in bucket h(x)

U: universem = size of Sn = # buckets

Page 10: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

10

Hash Tables: High-level Idea

0

1

..

n-2

n-1

hU

U: universem = size of Sn = # buckets

Page 11: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

11

CollisionsMultiple elements are hashed to the same bucket.

Assume we are about to insert new x and h(x) is

already full

Resolving Collisions:

Chaining: Linked list per bucket; append x to the list

Open Addressing: If h(x) is already full, we

deterministically assign x to another empty bucket

Saves space

Page 12: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

12

Chaining

0

1

..

n-2

n-1

NullNull

NullNull

e3 = h(e3)=1

U: universem = size of Sn = # buckets

Page 13: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

13

Chaining

0

1

..

n-2

n-1

Null

NullNull

e3

e7 = h(e7)=n-2

U: universem = size of Sn = # buckets

Page 14: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

14

Chaining

0

1

..

n-2

n-1

Null

Null

e3

e5 = h(e5)=n-1

e7

U: universem = size of Sn = # buckets

Page 15: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

15

Chaining

0

1

..

n-2

n-1

Null

e3

e1 = h(e1)=1

e7

e5

U: universem = size of Sn = # buckets

Page 16: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

16

Chaining

0

1

..

n-2

n-1

Null

e3

e7

e5

e1

e4 = h(e4)=1

U: universem = size of Sn = # buckets

Page 17: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

17

Chaining

0

1

..

n-2

n-1

Null

e3

e7

e5

e1 e4

U: universem = size of Sn = # buckets

Page 18: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

18

Operations (With Chaining) Insert(x): Go to bucket h(x); If x is not in list, append it.

Delete(x): Go to bucket h(x); If x is in list, delete it.

Lookup(x): Go to bucket h(x); Return true if x is in the

list

Page 19: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

19

Running Time of OperationsAssume evaluating the hash function is constant time

May not be true for all hash functions

Consider an element x

0

1

..

n-2

n-1

Null

e3

e7

e5

e1 e4

Lookup:

O(|Linked list h(x)|)

Insert: O(|Linked list h(x)|)

Delete:

O(|Linked list h(x)|)

U: universem = size of Sn = # buckets

Page 20: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

20

Worst & Best ScenariosAssume m: # elements in the hash table

Worst Case: O(m)

Best Case: O(1)

|Linked lists| depends on the quality of the hash function!

Fundamental Question: How can we choose “good” hash functions?

U: universem = size of Sn = # buckets

Page 21: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

21

Bad Hash FunctionsRecall our IP addresses example: 32 bits

# buckets n = 28

Idea: Use most significant 8 bits

Big correlations with geography of how IP addresses are

assigned: 171, 172 as the first 8 bits is common

Lots of addresses would get mapped to the same

bucket

In practice should be very careful when

picking hash functions!U: universem = size of Sn = # buckets

Page 22: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

22

Is There A Single Good Hash Function? Idea: Design a clever hash function **h** that spreads

every data sets evenly across the buckets.

Problem: Cannot exist!

0

1

..

n-2

n-1

**h**

U Recall |U| >> m≈n

by pigeonhole:∃bucket i, s.t. |list i| ≥ |U|/n

If S is all from i, then all operations O(m)!

Page 23: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

23

No Single Good Hash Function!

Claim: For every single hash function

h, there is a pathological data set!

Proof: By pigeonhole principle

Page 24: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

24

Solution: Pick a Hash Function RandomlyDesign a set or a “family” H of hash functions,

s.t. ∀ data sets S, if we pick a h∈ H randomly,

then almost always we spread S out evenly

across buckets.

Question: Why couldn’t you have randomness inside

your hash function?

Page 25: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Clarification on Proposed Analysis

Hash TableInput:S

Performance

Pick h randomly from H

We’ll analyze the expected performance on any but fixed input S.

Page 26: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Clarification on Proposed Analysis

Hash TableInput:S Performance1

Pick h1 randomly from H

Hash TableInput:S Performance2

Pick h2 randomly from H …

Hash TableInput:S Performancet

Pick ht randomly from H

Page 27: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

27

Roadmap

1. Define H being “Universal”

2. If H is universal and we pick h ∈H

randomly, then our hash table has O(1)

expected cost

3. Show simple and practical H exist.

Page 28: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

28

1. Universal Family of Hash Functions

Let H be a set of functions from |U| to {0, 1, …, n-1}.

Definition: H is universal if ∀ x, y ∈ U, s.t. x ≠ y,

if h is chosen uniformly at random from H then:

Pr(h(x) = h(y)) ≤ 1/n

I.e., the fraction of hash functions of H that make

x & y collide is at most 1/n

Why 1/n?

“As if we were independently mapping x, y to

buckets (& uniformly at random).”U: universem = size of Sn = # buckets

Page 29: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

29

2. Universality => Operations Are O(1)

Let H be a universal family of hash functions from |U|

to {0, 1, …, n-1}.

Recall m = O(n)

Claim: If h is picked randomly from H => for any

data set S, hash table operations are O(1).

U: universem = size of Sn = # buckets

Page 30: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

30

2. Universality => Operations Are O(1)

Proof:

U: universem = size of Sn = # buckets

Hash TableS

Pick h randomly from H

0

1

..

n-2

n-1

…e3

e7

e5

e1 e4

e9 e27

A new element x arrives. Say we want to perform

Lookup(x).

Cost: O(# elements in bucket h(x)).

This quantity is a random variable. Call it

Z.

Page 31: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

31

2. Universality => Operations Are O(1)

Proof Continued: Z=# elements in bucket h(x).

For each element y ∈ S, let Xy be 1 if h(y) = h(x).

U: universem = size of Sn = # buckets

1 is in case x is already there

Q.E.D

Page 32: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

32

3. Universal Families of HF Exist (1)

Let n=2b, |U|=2t and t>b

Represent each x as t bit binary vector

Ex: |U|=27=128, hash table has size 24=16

|U| = 2t n = 2b

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

0

1

1

0

1

0

0

1

1

0

0

M x h(x)=Mx

Random 0/1 b x t matrix

=

multiplication mod 2

bucket 12

element52

Page 33: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

3. Universal Families of HF Exist (2)

33

h(x): Mx: 2t -> 2b or U -> {0, 1, …, n-

1}

H = All possible b x t 0/1 random

matrices0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

0

1

1

0

1

0

0

1

1

0

0

M x h(x)=Mx

Random 0/1 b x t matrix

=

multiplication mod 2|U| = 2t

n = 2b

Page 34: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Proof that H is Universal (1)

34

Need to prove that ∀ x ≠ y, Pr(h(x) = h(y)) ≤ 1/n =

1/2b, when M is picked uniformly at random from H .

=> equivalently when each cell of M is picked

randomly.

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

0

1

1

0

1

0

0

1

1

0

0

M x h(x)

=

|U| = 2t n = 2b

Page 35: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Proof that H is Universal (2)

35

x, y differ in at least one bit (say w.l.o.g., the last

bit)

let z = x-yz1

z2

z3

z4

z5

1

0

0

0

0

M z Mz

=

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

Q: Pr(Mz =0)?

|U| = 2t n = 2b

Page 36: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Proof that H is Universal (3)

36

Pr(Mz=0) = Pr(Mz[0]=0 & Mz[1]=0 & … Mz[b] = 0)

**Event Mz[i]=0 is independent from Mz[j]=0 since, the

coin flips for Mz[i] are independent from the coin flips for

Mz[j]**

Pr(Mz = 0) = Pr(Mz[0]=0) Pr(Mz[1)=0) …

Pr(Mz[b]=0)z1

z2

z3

z4

z5

1

0

0

0

0

M z Mz

=

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

Q: Pr(Mz[i] =0)?|U| = 2t n = 2b

Page 37: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Proof that H is Universal (4)

37

Pr(Mz[i]=0)

z1

z2

z3

z4

z5

1

0

0

0

0

M z Mz

=

0 1 1 0 1 0 1

1 0 0 1 1 0 0

1 1 1 0 0 0 1

0 0 1 1 0 0 0

Mz[i] = mi1z1 + mi2z2 + … 1*mit

Let y be the (modulo 2) sum of the first t-1

multiplications, Mz[i] = 0 iff mit is equal to ¬y!

i

Page 38: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Proof that H is Universal (5)

38

Pr(Mz[i] =0) = 1/2 Pr(Mx=My)=Pr(Mz = 0) = 1/2b = 1/n

Irrespective of the fist t-1 coin flips, it all

depends on the last coin flip.

|U| = 2t n = 2b

Q.E.D

Page 39: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Storing and Evaluating Hash Function h (M)

39

Q: How much space do we need to store the

random matrix M?

A: bt bits = O(log|U|log(n))

How much time to evaluate Mz?

A: Naïve: bt2=O(log|U|log(n))

Summary: H is a relatively fast, and

practical universal family of hash functions

Page 40: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Another Possible Family

40

We’re hashing from U -> {0, 1, …, n-1}

Let H be the set of all such functions

Question: Is H universal?

Page 41: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Another Possible Family

41

We’re hashing from U -> {0, 1, …, n-1}

Q1: # such functions?

A1: nU

Q2: # functions in which h(x) = h(y)=j?

A2: nU-2

Q3: # functions in which h(x) = h(y)?

A3: nnU-2 = nU-1

Q4: Pr(h(x) = h(y)?Answer: 1/n => H is

universal!

Page 42: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

Why is H Impractical?

42

There are nU functions in H .

What’s cost of storing a function h fromH?

log(|H|)=O(Ulog(n)!

Not Practical!

Page 43: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

43

Summary

1. Hash Tables

2. Defined Universal Family of Hash Functions

3. Universal family => Hash Table ops are expected O(1)

time

4. Universal families exist

Page 44: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

44

Outline For Today

1. Hash Tables and Universal Hashing

2. Bloom Filters

Page 45: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

45

Bloom FiltersRandomized Data Structure

Implementing a limited version of Dictionary ADT Insert Lookup

Compared to Hash Tables:

Applications

Website caches for ISPs

Cons

No Deletes

Not Always Correct Output to Lookup(x) => false positives

Pros

More Space Efficient

no pointers to actual objects

inserted

Page 46: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

46

Same Setup As Hash TablesUniverse U of all possible elements

All possible 232 IP addresses

Maintain a subset S ⊆ U

|S|=m and |U| >> m

Page 47: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

47

Bloom Filters

A Bloom Filter consists of:

A bit array of size n initially all 0 (not

buckets)

k hash functions h1, …, hk

Space cost per element= n/m0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 48: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

48

Insertions

Insert(a): set all hi(a) to 1 => O(k)

Let k = 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0

h1(x)=2, h2(x)=9

h3(x)=0

h1(y)=1, h2(y)=5

h3(y)=91 1 1 0 0 1 0 0 0 1 0 0 0 0 0 0

h1(z)=10,

h2(z)=11

h3(z)=5

1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0

Do you see why there would be false positives?

Page 49: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

49

Lookup

Lookup(a): return true if all hi(a) = 1 => O(k)

x: h1(x)=2, h2(x)=9 h3(x)=0 => Lookup(x) = true

z: h1(z)=3, h2(z)=9 h3(z)=4 => Lookup(z) = false

t: h1(t)=0, h2(t)=1 h3(t)=2 => Lookup(t) = true

1 1 1 0 0 1 0 0 0 1 1 1 0 0 0 0

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Page 50: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

50

Can Bloom Filters Be Useful?

Can Bloom Filters be both space efficient

and have a low false positive rate?

What is the probability of false positives as

a function of n, m and k?

Page 51: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

51

Probability of False Positive

We have inserted m elements to the bloom

filter.

New element z arrives, not been inserted

before.

Q: What’s the Pr(false positive for z)?

Assume h1(z) = j1, …, hk(z) = jk

**Simplifying (Unjustified) Assumption**: All

hashing is totally random!

∀hi, ∀x, hi(x) is uniformly random from {1, …, m}

and independent from all hj(y) for all y.

Warning: To simplify analysis. Won’t hold in

practice.

Page 52: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

52

Pr(bit j is 1 after m insertions)?

Consider a particular bit j in the array.

Q1: Fix hi and an element x. Pr(hi(x) turns j to 1)?

A1: 1/n

Q2: Pr(x turns j to 1)? (Prob. one of h1(x), …, hk(x)

= j?)

A2: 1-Pr(x does not turn j to 1)= 1- (1-1/n)k

Q3: Pr(Bit j = 1 after m insertions)?

A3: 1-Pr(no element turns j to 1)= 1 – (1-1/n)km

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 53: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

53

Pr(false positive for x)?

Recall for x we check k bits: h1(x) = j1, …, hk(x) =

jk

Pr(bit ji = 1) = 1 – (1-1/n)km

Pr(false positive) = Pr(all ji = 1)= (1 –

(1-1/n)km)k

Recall Calculus fact: (1+x) ≤ ex

From the same fact: around x=0, (1+x) ≈ ex

Pr(false positive) ≈

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Page 54: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

54

How Does Failure Rate Change With k,n?

Observation 1: as n increases failure rate

decreases.

Observation 2: as k increases(the # hash

functions)

more bits to check => less likely to fail

more bits/object => more likely to fail

unclear if it increases or decreases

Question: What’s the optimal k for fixed n/m?

Answer (by taking derivatives): k=ln(2)n/m =

0.69*n/m

Failure rate

=

Page 55: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

55

How Does Failure Rate Change With k,n?For fixed n/m, with optimal k=ln(2)n/m

Failure rate:

Already at n=8m, rate is 1-2%.

Exponentially decrease with n/m.

Page 56: Hash Tables Universal Families of Hash Functions Bloom Filters Wednesday, July 23 rd 1.

56

Next Week Dynamic Programming