Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table...

73
Hash tables Hash tables

Transcript of Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table...

Page 1: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash tables

Hash tables

Page 2: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Dictionary

Definition

A dictionary is a data structure that stores a set of elementswhere each element has a unique key, and supports thefollowing operations:

Search(S , k) Return the element whose key is k .

Insert(S , x) Add x to S .

Delete(S , x) Remove x from S

Assume that the keys are from U = 0, . . . , u − 1.

Hash tables

Page 3: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Direct addressing

S is stored in an array T [0..u − 1]. The entry T [k]contains a pointer to the element with key k , if suchelement exists, and NULL otherwise.

Search(T , k): return T [k].

Insert(T , x): T [x .key]← x .

Delete(T , x): T [x .key]← NULL.

What is the problem with this structure?0123456789

23

5

8

keysatellitedata

1

94

07

6

25

38

U

S

Hash tables

Page 4: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hashing

In order to reduce the space complexity of directaddressing, map the keys to a smaller range0, . . .m − 1 using a hash function.

There is a problem of collisions (two or more keys thatare mapped to the same value).

There are several methods to handle collisions.

ChainingOpen addressing

Hash tables

Page 5: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash table with chaining

Let h : U → 0, . . . ,m − 1 be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from the list containing x .

S=30,9,26,19,1,6

m=5, h(x)=x mod 5

Hash tables

Page 6: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash table with chaining

Let h : U → 0, . . . ,m − 1 be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from the list containing x .

S=30,9,26,19,1,6

m=5, h(x)=x mod 5

30

Hash tables

Page 7: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash table with chaining

Let h : U → 0, . . . ,m − 1 be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from the list containing x .

S=30,9,26,19,1,6

m=5, h(x)=x mod 5

30

9

Hash tables

Page 8: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash table with chaining

Let h : U → 0, . . . ,m − 1 be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from the list containing x .

S=30,9,26,19,1,6

m=5, h(x)=x mod 5

9

26

30

Hash tables

Page 9: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash table with chaining

Let h : U → 0, . . . ,m − 1 be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from the list containing x .

S=30,9,26,19,1,6

m=5, h(x)=x mod 5

19 9

26

30

Hash tables

Page 10: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash table with chaining

Let h : U → 0, . . . ,m − 1 be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from the list containing x .

S=30,9,26,19,1,6

m=5, h(x)=x mod 5

19 9

261

30

Hash tables

Page 11: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash table with chaining

Let h : U → 0, . . . ,m − 1 be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from the list containing x .

S=30,9,26,19,1,6

m=5, h(x)=x mod 5

19 9

6 261

30

Hash tables

Page 12: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Analysis — Worst case

The worst case running times of the operations are:

Search: Θ(n).

Insert: Θ(1).

Delete: Θ(n) (can be reduced to Θ(1) using doublylinked list).

Hash tables

Page 13: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Analysis

Assumption of simple uniform hashing: any element isequally likely to hash into any of the m slots,independently of where other elements have hashed into.

The above assumption is true when the keys are chosenuniformly and independently at random (withrepetitions), h(x) = x mod m, and u is divisible by m.

We want to analyze the performance of hashing under theassumption of simple uniform hashing. This is the ballsinto bins problem.

Suppose we randomly place n balls into m bins. Let X bethe number of balls in bin 0.

The expected time of an unsuccessful search in a hashtable is Θ(1 + E [X ]).

Hash tables

Page 14: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

The expectation of X

Each ball has probability 1/m to be in bin 0.

The random variable X has binomial distribution withparameters n and 1/m.

Therefore, E [X ] = n/m.

The expected time of an unsuccessful search in a hashtable is Θ(1 + α), where α = n/m.

The value α is called the load factor.

The expected time of searching a random key from S isalso Θ(1 + α).

Hash tables

Page 15: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Example

If α = 1 and n is large:

Pr[X = 0] ≈ 0.368 Pr[X = 4] ≈ 0.015

Pr[X = 1] ≈ 0.368 Pr[X = 5] ≈ 0.003

Pr[X = 2] ≈ 0.184 Pr[X = 6] ≈ 0.0005

Pr[X = 3] ≈ 0.061 Pr[X = 7] ≈ 0.00007

Hash tables

Page 16: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

The maximum number of balls in a bin

Suppose that n = m = 106. Most bins contain 0–3 balls.

The probability that a specific bin contains at least 8 ballsis ≈ 0.000009.

However, it is very likely that some bin will contain 8balls.

Theorem

If m ∈ Θ(n) then the size of the largest bin isΘ(log n/ log log n) with probability at least 1− 1/nΘ(1).

Hash tables

Page 17: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Universal hash functions

Random keys + fixed hash function (e.g. mod)⇒ The hashed keys are random numbers.

A fixed set of keys + random hash function (selectedfrom a universal set)⇒ The hashed keys are semi-random numbers.

Hash tables

Page 18: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Universal hash functions

Definition

A collection H of hash functions is a universal if for every pairof distinct keys x , y ∈ U , Prh∈H[h(x) = h(y)] ≤ 1

m.

Example

Let p be a prime number larger than u.fa,b(x) = ((ax + b) mod p) mod mHp,m = fa,b|a ∈ 1, 2, . . . , p − 1, b ∈ 0, 1, . . . , p − 1

Page 19: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Universal hash functions

Theorem

Suppose that H is a universal collection of hash functions.If a hash table for S is built using a randomly chosen h ∈ H,then for every k ∈ U, the expected length of T [h(k)] is atmost 1 + n/m.

Proof.

Let X = length of T [h(k)].X =

∑y∈S Iy where Iy = 1 if h(y .key) = h(k) and Iy = 0

otherwise.

E [X ] = E

[∑y∈S

Iy

]=

∑y∈S

E [Iy ] =∑y∈S

Prh∈H

[h(y .key) = h(k)]

If k /∈ S , E [X ] ≤ n · 1m

. Otherwise, E [X ] ≤ 1 + (n − 1) 1m

.

Hash tables

Page 20: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Rehashing

We wish to maintain n ∈ O(m) in order to have Θ(1)expected search time.

This can be achieved by rehashing. Suppose we wantn ≤ m. When the table has m elements and a newelement is inserted, create a new table of size 2m andcopy all elements into the new table.

19 34

6 21

h(x)=x mod 5

30

34

21

6

30

h'(x)=x mod 10

19

Hash tables

Page 21: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Rehashing

We wish to maintain n ∈ O(m) in order to have Θ(1)expected search time.

This can be achieved by rehashing. Suppose we wantn ≤ m. When the table has m elements and a newelement is inserted, create a new table of size 2m andcopy all elements into the new table.The cost of rehashing is Θ(n).

Hash tables

Page 22: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Rehashing

In order to keep the space usage Θ(n), we needn ∈ Ω(m).

We can keep α ∈ [ 14, 1]. If an insert would make α > 1,

create a new table of size 2m. If a delete would makeα < 1

4, create a new table of size m/2.

Hash tables

Page 23: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Time complexity of chaining

Under the assumption of simple uniform hashing, theexpected time of a search is Θ(1 + α) time.

If α = Θ(1), and under the assumption of simple uniformhashing, the worst case time of a search isΘ(log n/ log log n), with probability at least 1− 1/nΘ(1).

If the hash function is chosen from a universal collectionat random, the expected time of a search is Θ(1 + α).

The worst case time of insert is Θ(1) if there is norehashing.

Using doubly linked lists, the worst case time of delete isΘ(1).

Hash tables

Page 24: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

h(k)=k mod 10

Hash tables

Page 25: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

insert(T,14)

Hash tables

Page 26: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

insert(T,20)

Hash tables

Page 27: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

34

insert(T,34)

Hash tables

Page 28: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

34

19insert(T,19)

Hash tables

Page 29: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

34

19

55

insert(T,55)

Hash tables

Page 30: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Linear probing

In the following, we assume that the elementsin the hash table are keys with no satelliteinformation.

To insert element k , try inserting to T [h(k)].If T [h(k)] is not empty, try T [h(k) + 1 mod m],then try T [h(k) + 2 mod m] etc.

Insert(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = NULL(4) T [j ]← k(5) return(6) j ← j + 1 mod m(7) error “hash table overflow”

0123456789

14

20

34

19

55

49

insert(T,49)

Hash tables

Page 31: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Searching

Search(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = k(4) return j(5) if T [j ] = NULL(6) return NULL(7) j ← j + 1 mod m(8) return NULL

0123456789

14

20

34

19

55

49

search(T,55)Hash tables

Page 32: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Searching

Search(T , k)

(1) j ← h(k)(2) for i = 0 to m − 1(3) if T [j ] = k(4) return j(5) if T [j ] = NULL(6) return NULL(7) j ← j + 1 mod m(8) return NULL

0123456789

14

20

34

19

55

49

search(T,25)Hash tables

Page 33: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Deletion

If we want to delete the element T [i ], we cannotwrite NULL into T [i ].

0123456789

14

20

34

19

55

49

23

delete 14

Hash tables

Page 34: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Deletion

If we want to delete the element T [i ], we cannotwrite NULL into T [i ].

0123456789

20

34

19

55

49

23

delete 14

Hash tables

Page 35: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Deletion

Delete method 1: To delete element T [i ], storein T [i ] a special value DELETED.

Change line (3) in Insert to:

if T [j ] = NULL OR T [j ] = DELETED

0123456789

20

34

19

55

49

23

delete 14

D

Hash tables

Page 36: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Deletion

Delete method 2: Erase T [i ] from the table (re-place it by NULL) and also erase the consecutiveblock of elements following T [i ]. Then, reinsertthe latter elements to the table.

0123456789

14

20

34

19

55

49

23

delete 14

Hash tables

Page 37: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Deletion

Delete method 2: Erase T [i ] from the table (re-place it by NULL) and also erase the consecutiveblock of elements following T [i ]. Then, reinsertthe latter elements to the table.

0123456789

20

19

49

23

delete 14

Hash tables

Page 38: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Deletion

Delete method 2: Erase T [i ] from the table (re-place it by NULL) and also erase the consecutiveblock of elements following T [i ]. Then, reinsertthe latter elements to the table.

0123456789

20

19

49

23

delete 14

34

Hash tables

Page 39: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Deletion

Delete method 2: Erase T [i ] from the table (re-place it by NULL) and also erase the consecutiveblock of elements following T [i ]. Then, reinsertthe latter elements to the table.

0123456789

20

19

49

23

delete 14

3455

Hash tables

Page 40: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Time complexity of linear probing

Assuming simple uniform hashing and that α ≤ 1− ε for somefixed constant ε > 0:

The expected time of search is Θ(1).

The expected time of insert/delete without rehashing isΘ(1).

Hash tables

Page 41: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Open addressing

Open addressing is a generalization oflinear probing.

Leth : U × 0, . . .m − 1 → 0, . . .m − 1be a hash function such thath(k , 0), h(k , 1), . . . h(k ,m − 1) is apermutation of 0, . . .m − 1 for everyk ∈ U .

The slots examined during search/insertare h(k , 0), then h(k , 1), h(k , 2) etc.

In the example on the right,h(k , i) = (h1(k) + ih2(k)) mod 13 whereh1(k) = k mod 13h2(k) = 1 + (k mod 11)

0123456789

101112

79

6998

72

50

14

insert(T,14)

h(14,0)

h(14,1)

h(14,2)

Hash tables

Page 42: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Open addressing

Insertion is the same as in linear probing:Insert(T , k)

(1) for i = 0 to m − 1(2) j ← h(k , i)(3) if T [j ] = NULL OR T [j ] = DELETED(4) T [j ]← k(5) return(6) error “hash table overflow”

Deletion is done using delete method 1 defined above(using special value DELETED).

Hash tables

Page 43: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Double hashing

In the double hashing method,

h(k , i) = (h1(k) + ih2(k)) mod m

for some hash functions h1 and h2.The value h2(k) must be relatively prime to m for theentire hash table to be searched. This can be ensured byeither

Taking m to be a power of 2, and the image of h2

contains only odd numbers.Taking m to be a prime number, and the image of h2

contains integers from 1, . . . ,m − 1.For example,

h1(k) = k mod m

h2(k) = 1 + (k mod m′)

where m is prime and m′ < m.Hash tables

Page 44: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Comparison of hash table methods – Search

8 10 12 14 16 18 20 220

50

100

150

200

250

300

log n

Clo

ck C

ycle

s

Successful Lookup

CuckooTwo−Way ChainingChained HashingLinear Probing

Hash tables

Page 45: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Comparison of hash table methods – Search

8 10 12 14 16 18 20 220

50

100

150

200

250

300

log n

Clo

ck C

ycle

s

Unsuccessful Lookup

CuckooTwo−Way ChainingChained HashingLinear Probing

Hash tables

Page 46: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Comparison of hash table methods – Insert

8 10 12 14 16 18 20 220

50

100

150

200

250

300

350

400

450

log n

Clo

ck C

ycle

s

Insert

CuckooTwo−Way ChainingChained HashingLinear Probing

Hash tables

Page 47: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Comparison of hash table methods – Delete

8 10 12 14 16 18 20 220

50

100

150

200

250

300

350

log n

Clo

ck C

ycle

s

Delete

CuckooTwo−Way ChainingChained HashingLinear Probing

Hash tables

Page 48: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash function for strings

A string can be viewed as a sequence of integer, whereeach character is replaced by its ASCII value.

For example, the string ABXY is the sequence65,66,88,89.

To hash a sequence of integers x0, . . . , xr−1 from1, . . . , u, we use a prime number p > u. We select arandom number z ∈ 0, . . . , p − 1 and use the functionhz(x0, . . . , xr−1) = (x0z0 + x1z1 + · · ·+ xr−1z r−1) mod p.

For example h300(65, 66, 88, 89) =(65 + 66 · 300 + 88 · 3002 + 89 · 3003) mod p.

Hash tables

Page 49: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Hash function for strings

We definedhz(x0, . . . , xr−1) = (x0z0 + x1z1 + · · ·+ xr−1z r−1) mod p

Theorem

Given two different sequences x0, . . . , xr−1 and y0, . . . , yr−1

over 1, . . . , u,Pr[hz(x0, . . . , xr−1) = hz(y0, . . . , yr−1)] ≤ (r − 1)/p when z ischosen randomly from 0, . . . , p − 1.

Proof.

A non-trivial polynomial of degree d over the field Zp has atmost d roots. Therefore, the polynomialQ(z) = hz(x0, . . . , xr−1)− hz(y0, . . . , yr−1) =(x0 − y0)z0 + · · ·+ (xr−1 − yr−1)z r−1

has at most r − 1 roots over the field Zp.The prob. of choosing z to be a root of Q is ≤ (r − 1)/p.

Hash tables

Page 50: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Applications of hash functions

Applications of hash functions

Page 51: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Applications of hash functions

Data deduplication: Suppose that we have many files, andsome files have duplicates. In order to save storage space,we want to store only one instance of each distinct file.

Rabin-Karp pattern matching algorithm To find theoccurrences of a string P of length m in a string T ,compute a hash function on every substring of T oflength m and compare with h(P).

Distributed storage: Storing files on several servers.

Applications of hash functions

Page 52: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Rabin-Karp pattern matching algorithm

Suppose we are given strings P and T , and we want tofind all occurrences of P in T .

The Rabin-Karp algorithm is as follows:Compute h(P)for every substring T ′ of T of length |P |

if h(T ′) = h(P) check whether T ′ = P .

The values h(T ′) for all T ′ can be computed in Θ(|T |)time using rolling hash.

Example

Let T = CABXYZ, P = ABXY.

h300(CABX ) = (67 + 65 · 300 + 66 · 3002 + 88 · 3003) mod p

h300(ABXY ) = (65 + 66 · 300 + 88 · 3002 + 89 · 3003) mod p

= ((h300(CABX )− 67)/300 + 89 · 3003) mod p

Applications of hash functions

Page 53: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Distributed storage

Suppose we have many files, and we want to store themon several servers.

Let m denote the number of servers.

The simple solution is to use a hash functionh : U → 1, . . . ,m, and assign file x to server h(x).

The problem with this solution is that if we add orremove a server, we need to do rehashing which will movemost files between servers.

Applications of hash functions

Page 54: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Distributed storage: Consistent hashing

Suppose that the server have identifiers s1, . . . , sm.

Let h : U → [0, 1] be a hash function.

For each server i associate a point h(si) on the unit circle.

For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).

0

server 1

server 3

server 2

Applications of hash functions

Page 55: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Distributed storage: Consistent hashing

Suppose that the server have identifiers s1, . . . , sm.

Let h : U → [0, 1] be a hash function.

For each server i associate a point h(si) on the unit circle.

For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).

0

server 1

server 3

server 2

Applications of hash functions

Page 56: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Distributed storage: Consistent hashing

When a new server m + 1 is added, let i be the serverwhose point is the first server point after h(sm+1).

We only need to reassign some of the files that wereassigned to server i .

The expected number of files reassignments is n/(m + 1).

0

server 1

server 3

server 2

server 4

Applications of hash functions

Page 57: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Bloom Filter

Bloom Filter

Page 58: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Blocking malicious web pages

Suppose we want to implement blocking of malicious webpages in a web browser.

Assume we have a list of 10,000,000 malicious pages.

We can store all pages in a hash table, but this requires alarge amount of memory.

Bloom filter is a solution for the problem that uses smallamount of memory.

Bloom filter stores a set of elements and supportMembership and Insert operations.

For an element that is in the set, the Membershipoperation always returns a positive answer.

For an element that is not in the set, the Membershipoperation can return either a negative answer, or apositive answer (false positive).

Bloom Filter

Page 59: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Blocking malicious web pages

Suppose that the list of malicious pages is stored in aBloom filter.

For a query URL, if the answer is negative, we know thatthe URL is not a malicious page.

If the answer is positive, we do not know the correctanswer. In this case, we get the answer from a server.

We want a small probability for a false positive.

Bloom Filter

Page 60: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

0

Bloom Filter

Page 61: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01

Bloom Filter

Page 62: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

011

Bloom Filter

Page 63: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

011

Bloom Filter

Page 64: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Bloom filter — simple version

Suppose we want to store a set S .

We use an array A of size m initialized to 0.

We then choose a hash function h, and set A[h(x)]← 1for every x ∈ S .

For a query y , if A[h(y)] = 0, we know that y /∈ S .

If A[h(y)] = 1, either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

011

Bloom Filter

Page 65: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Probability of false positive

Assume simple uniform hashing: any element is equallylikely to hash into any of the m slots, independently ofwhere other elements have hashed into.

For fixed i , the probability that A[i ] = 0 isp = (1− 1/m)n.

For y /∈ S , the probability that A[h(y)] = 1 is 1− p.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

011

Bloom Filter

Page 66: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Reducing the false positive probability

We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

0

Bloom Filter

Page 67: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Reducing the false positive probability

We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 11

Bloom Filter

Page 68: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Reducing the false positive probability

We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 1111

Bloom Filter

Page 69: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Reducing the false positive probability

We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 1111

Bloom Filter

Page 70: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Reducing the false positive probability

We choose t hash functions h1, h2, . . . , ht , and setA[hj(x)]← 1 for j = 1, 2, . . . t, for every x ∈ S .

For a query y , if A[hj(y)] = 0 for some j , we know thaty /∈ S .

If A[hj(y)] = 1 for all j , either y is in S or not.

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 1111

Bloom Filter

Page 71: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Probability of false positive

For fixed i , the probability that A[i ] = 0 isp = (1− 1/m)tn.

For y /∈ S , the probability that A[h(y)] = 1 is (1− p)t .

000 0 0 0 0 0 0 00 1 2 3 4 5 6 7 8 9

01 1111

Bloom Filter

Page 72: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Optimizing the probability of false positive

For the previous example (n = 2, m = 10):

For t = 1, p = 0.81, (1− p)1 = 0.19.

For t = 2, p ≈ 0.656, (1− p)2 ≈ 0.118.

For t = 3, p ≈ 0.531, (1− p)3 ≈ 0.103.

For t = 4, p ≈ 0.430, (1− p)4 ≈ 0.105.

For t = 5, p ≈ 0.349, (1− p)5 ≈ 0.117.

Bloom Filter

Page 73: Hash tablesds202/wiki.files/08-hash.pdf · 2020-04-29 · Open addressing Hash tables. Hash table with chaining Let h : U !f0;:::;m 1gbe a hash function (m

Optimizing the probability of false positive

Let f (t) be the probability of false positive when using thash functions.

p = (1− 1/m)tn ≈ e−tn/m

f (t) = (1− p)t ≈ (1− e−tn/m)t = et·ln(1−e−tn/m)

Minimizing f (t) is equivalent to minimizingg(t) = t · ln(1− e−tn/m).dgdt

= ln(1− e−tn/m) + tnm

e−tn/m

1−e−tn/m .

The minimum of g(t) is when dg/dt = 0 =⇒ t = mn

ln 2.

The minimum of f (t) is ≈ 0.6185m/n.

Example: For m = 8n, mn

ln 2 ≈ 5.55. f (5) ≈ 0.0217,f (6) ≈ 0.0216.

Bloom Filter