Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with...

Post on 14-Jul-2020

13 views 0 download

Transcript of Hash tables - cs.bgu.ac.ilds202/wiki.files/hash.pdf · Open addressing Hash tables. Hash table with...

Hash tables

Hash tables

Basic Probability Theory

Two events A,B are independent if

Pr[A ∩ B] = Pr[A] · Pr[B]

Conditional probability:

Pr[A|B] =Pr[A ∩ B]

Pr[B]

The expectation of a (discrete) random variable X is

E [X ] =∑k

k · Pr[X = k]

In particular, if the values of X are 0 or 1, thenE [X ] = Pr[X = 1].

Hash tables

Basic Probability Theory

Union bound: For events A1, . . . ,Ak ,

Pr

[k⋃

i=1

Ai

]≤

k∑i=1

Pr[Ai ]

Linearity of expectation: For random variables X1, . . . ,Xk ,

E

[k∑

i=1

Xi

]=∑i=1

E [Xi ]

Hash tables

Dictionary

Definition

A dictionary is a data-structure that stores a set S ofelements, where each element x has a field x .key, andsupports the following operations:

Search(S , k) Return an element x ∈ S with x .key = k

Insert(S , x) Add x to S .

Delete(S , x) Delete x from S .

We will assume that

The keys are from U = {0, . . . , u − 1}.The keys are distinct.

Hash tables

Direct addressing

S is stored in an array T [0..m − 1]. The entry T [k]contains a pointer to the element with key k , if suchelement exists, and NULL otherwise.

Search(T , k): return T [k].

Insert(T , x): T [x .key]← x .

Delete(T , x): T [x .key]← NULL.

What is the problem with this structure?

0123456789

23

5

8

keysatellitedata

1

94

07

6

25

38

U

S

Hash tables

Hashing

In order to reduce the space complexity of directaddressing, map the keys to a smaller range{0, . . .m − 1} using a hash function.

There is a problem of collisions (two or more keys thatare mapped to the same value).

There are two ways to handle collisions:

ChainingOpen addressing

Hash tables

Hash table with chaining

Let h : U → {0, . . . ,m − 1} be a hash function (m < u).

S is stored in a table T [0..m − 1] of linked lists. Theelement x ∈ S is stored in the list T [h(x .key)].

Search(T , k): Search the list T [h(k)].

Insert(T , x): Insert x at the head of T [h(x .key)].

Delete(T , x): Delete x from T [h(x .key)].

19 9

6 26S={6,9,19,26,30}

m=5, h(x)=x mod 5

30

Hash tables

Analysis

Assumption of simple uniform hashing: any element isequally likely to hash into any of the m slots,independently of where other elements have hashed into.

The above assumption is true when the keys are chosenuniformly and independently at random (withrepetitions), and the hash function satisfies|{k ∈ U : h(k) = i}| = u/m for every i ∈ {0, . . . ,m− 1}.We want to analyze the performance of hashing under theassumption of simple uniform hashing. This is the ballsinto bins problem.

Suppose we randomly place n balls into m bins. Let X bethe number of balls in bin 1.

The time complexity of a random search in a hash table isΘ(1 + X ).

Hash tables

The expectation of X

Let α = n/m.

Claim

E [X ] = α.

Proof.

Let Ij be a random variable which is 1 if the j-th ball is inbin 1, and 0 otherwise. X =

∑nj=1 Ij , so

E [X ] = E [n∑

j=1

Ij ] =n∑

j=1

E [Ij ] =n∑

j=1

Pr[Ij = 1] = n · 1

m

Hash tables

The distribution of X

Claim

Pr[X = r ] ≈ e−ααr

r !. (i.e., X has approximately Poisson

distribution).

Proof.

Pr[X = r ] =

(n

r

)(1

m

)r (1− 1

m

)n−r

=n(n − 1) · · · (n − r + 1)

r !

1

mr

(1− 1

m

)n−r

If m and n are large, n(n − 1) · · · (n − r + 1) ≈ nr and(1− 1

m

)n−r ≈ e−n/m. Thus, Pr[X = r ] ≈ e−n/m(n/m)r

r !.

Hash tables

The maximum number of balls in a bin

Let Y be the maximum number of balls in some bin.

Claim

For m = n, Pr[Y ≥ R] ≤ 1n1.5 , where R = e ln n/ ln ln n.

Proof.

Let Xi be the number of balls in bin i .

Pr[Y ≥ r ] ≤n∑

i=1

Pr[Xi ≥ r ] = n · Pr[X ≥ r ]

Pr[X ≥ r ] ≤(n

r

)(1

n

)r

≤ nr

r !· 1

nr=

1

r !≤(er

)rwhere the last inequality is true since er =

∑∞i=0

r i

i!≥ r r

r !.

Hash tables

The distribution of maxi Xi

Proof (continued).

Pr[Y ≥ R] ≤ n ·( eR

)R= n

(ln ln n

ln n

)e ln n/ ln ln n

= e ln ne(ln ln ln n−ln ln n)·e ln n/ ln ln n

= e−(e−1) ln n+ln n· e ln ln ln nln ln n

≤ e−1.5 ln n =1

n1.5.

Hash tables

Rehashing

We wish to maintain n = O(m) in order to have Θ(1)search time.

This can be achieved by rehashing. Suppose we wantn ≤ m. When the table has m elements and a newelement is inserted, create a new table of size 2m andcopy all elements into the new table.

The cost of rehashing is Θ(n).

Hash tables

Universal hash functions

Definition

A collection H of hash functions is a universal if for every pairof distinct keys x , y ∈ U , Prh∈H [h(x) = h(y)] ≤ 2

m.

Example

Let p be a prime number larger than u.fa,b(x) = ((ax + b) mod p) mod mHp,m = {fa,b|a ∈ {1, 2, . . . , p − 1}, b ∈ {0, 1, . . . , p − 1}}

Universal hash functions

Theorem

Suppose that H is a universal collection of hash functions.If a hash table for S is built using a randomly chosen h ∈ H,then for every k ∈ U, the expected time of Search(S , k) isΘ(1 + n/m).

Proof.

Let X = length of T [h(k)].X =

∑y∈S Iy where Iy = 1 if h(y .key) = h(k) and Iy = 0

otherwise.

E [X ] = E

[∑y∈S

Iy

]=∑y∈S

E [Iy ] =∑y∈S

Prh∈H

[h(y .key) = h(k)]

≤ 1 + n · 2

m.

Hash tables

Time complexity of chaining

Under the assumption of simple uniform hashing, theexpected time of a search is Θ(1 + α) time.

If α = Θ(1), and under the assumption of simple uniformhashing, the worst case time of a search isΘ(log n/ log log n), with probability at least 1− 1/nΘ(1).

If the hash function is chosen from a universal collectionat random, the expected time of a search is Θ(1 + α).

The worst case time of insert is Θ(1) if there is norehashing.

Hash tables

Interpreting keys as natural numbers

How can we convert floats or ASCII strings to naturalnumbers?

An ASCII string can be interpreted as a number in base128.

Example

For the string CLRS, the ASCII values of the characters areC = 67, L = 76, R = 82, S = 83. So CLRS is(67·1283)+(76·1282)+(82·1281)+(83·1280) = 141, 764, 947.

Hash tables

Horner’s rule

Horner’s rule:

adxd + ad−1x

d−1 + · · ·+ a1x1 + a0 =

(· · · ((adx + ad−1)x + ad−2)x + · · · )x + a0.

Code:y = adfor i = d − 1 to 0

y = ai + xy

If d is large the value of y is too big.

Solution: evaluate the polynomial modulo p:y = adfor i = d − 1 to 0

y = (ai + xy) mod p

Hash tables

Rabin-Karp pattern matching algorithm

Suppose we are given strings P and T , and we want tofind all occurences of P in T .The Rabin-Karp algorithm is as follows:

Compute h(P)for every substring T ′ of T of length |P |

if h(T ′) = h(P) check whether T ′ = P .The values h(T ′) for all T ′ can be computed in Θ(|T |)time using rolling hash.

Example

Let T = BGUCS and P = GUC. Let T1 = BGU, T2 = GUC.

h(T1) = (66 · 1282 + 71 · 128 + 85) mod p

h(T2) = (71 · 1282 + 85 · 128 + 67) mod p

= (h(T1)− 66 · 1282) · 128 + 67 mod p

Hash tables

Applications

Data deduplication: Suppose that we have many files, andsome files have duplicates. In order to save storage space,we want to store only one instance of each distinct file.

Distributed storage: Suppose we have many files, and wewant to store them on several servers.

Hash tables

Distributed storage

Let m denote the number of servers.

The simple solution is to use a hash functionh : U → {1, . . . ,m}, and assign file x to server h(x).

The problem with this solution is that if we add a server,we need to do rehashing which will move most filesbetween servers.

Hash tables

Distributed storage: Consistent hashing

Suppose that the server have identifiers s1, . . . , sm.

Let h : U → [0, 1] be a hash function.

For each server i associate a point h(si) on the unit circle.

For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).

0

server 1

server 3

server 2

Hash tables

Distributed storage: Consistent hashing

Suppose that the server have identifiers s1, . . . , sm.

Let h : U → [0, 1] be a hash function.

For each server i associate a point h(si) on the unit circle.

For each file f , assign f to the server whose point is thefirst point encountered when traversing the unit cycleanti-clockwise starting from h(f ).

0

server 1

server 3

server 2

Hash tables

Distributed storage: Consistent hashing

When a new server m + 1 is added, let i be the serverwhose point is the first server point after h(sm+1).

We only need to reassign some of the files that wereassigned to server i .

The expected number of files reassignments is n/(m + 1).

0

server 1

server 3

server 2

server 4

Hash tables

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

h(k)=k mod 10

Hash tables

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

insert(T,14)

Hash tables

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

insert(T,20)

Hash tables

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

34

insert(T,34)

Hash tables

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

34

19insert(T,19)

Hash tables

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

34

19

55

insert(T,55)

Hash tables

Linear probing

In the following, we assume that theelements in the hash table are keys withno satellite information.

To insert an element k , try to insert toT [h(k)].If T [h(k)] is not empty, tryT [h(k) + 1 mod m], then tryT [h(k) + 2 mod m] etc.

Code:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = NULL OR ...

T [j ] = kreturn

error “hash table overflow”

0123456789

14

20

34

19

55

49

insert(T,49)

Hash tables

Insert

How to perform Search?

0123456789

14

20

34

19

55

49

search(T,55)

Hash tables

Insert

Search(T , k) is performed as follows:for i = 0, . . . ,m − 1

j = h(k) + i mod mif T [j ] = k

return jif T [j ] = NULL

return NULLreturn NULL

0123456789

14

20

34

19

55

49

search(T,25)

Hash tables

Delete

Delete method 1: To delete element k , store inT [h(k)] a special value DELETED.

0123456789

14

20

19

55

49

delete(T,34)

D

Hash tables

Delete

Delete method 2: Erase k from the table (re-place it by NULL) and also erase all the elementsin the block of k . Then, reinsert the latter ele-ments to the table.

0123456789

14

20

34

19

55

49

delete(T,34)

Hash tables

Delete

Delete method 2: Erase k from the table (re-place it by NULL) and also erase all the elementsin the block of k . Then, reinsert the latter ele-ments to the table.

0123456789

2049

19

Hash tables

Delete

Delete method 2: Erase k from the table (re-place it by NULL) and also erase all the elementsin the block of k . Then, reinsert the latter ele-ments to the table.

0123456789

2049

19

1455

Hash tables

Open addressing

Open addressing is a generalization oflinear probing.

Leth : U × {0, . . .m − 1} → {0, . . .m − 1}be a hash function such that{h(k , 0), h(k , 1), . . . h(k ,m − 1)} is apermutation of {0, . . .m − 1} for everyk ∈ U .

The slots examined during search/insertare h(k , 0), then h(k , 1), h(k , 2) etc.

In the example on the right,h(k , i) = (h1(k) + ih2(k)) mod 13 whereh1(k) = k mod 13h2(k) = 1 + (k mod 11)

0123456789

101112

79

6998

72

50

14

insert(T,14)Hash tables

Open addressing

Insertion is the same as in linear probing:for i = 0, . . . ,m − 1

j = h(k , i)if T [j ] = NULL OR T [j ] = DELETED

T [j ] = kreturn

error “hash table overflow”

Deletion is done using delete method 1 defined above(using special value DELETED).

Hash tables

Double hashing

In the double hashing method,

h(k , i) = (h1(k) + ih2(k)) mod m

for some hash functions h1 and h2.The value h2(k) must be relatively prime to m for theentire hash table to be searched. This can be ensured byeither

Taking m to be a power of 2, and the image of h2

contains only odd numbers.Taking m to be a prime number, and the image of h2

contains integers from {1, . . . ,m − 1}.For example,

h1(k) = k mod m

h2(k) = 1 mod m′

where m is prime and m′ < m.

Hash tables

Time complexity of open addressing

Assume uniform hashing: the probe sequence of each keyis equally likely to be any of the m! permutations of{0, . . .m − 1}.Assuming uniform hashing and no deletions, the expectednumber of probes in a search is

At most 11−α for unsuccessful search.

At most 1α ln 1

1−α for successful search.

Hash tables