COMP211slides_11

36
E.G.M. Petrakis Hashing 1 Hashing Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary key comparisons to find a key Question: find key with a single comparison Hashing: the location of a record is computed using its key only Fast for random accesses - slow for range queries

description

δομές αρχείων

Transcript of COMP211slides_11

Page 1: COMP211slides_11

E.G.M. Petrakis Hashing 1

Hashing

Data organization in main memory or disksequential, binary trees, …

The location of a key depends on other keys => unnecessary key comparisons to find a key Question: find key with a single comparison Hashing: the location of a record is computed using its key only

Fast for random accesses - slow for range queries

Page 2: COMP211slides_11

E.G.M. Petrakis Hashing 2

Hash Table

Hash Function: transforms keys to array indices

n...43210

h(key)dataindex

h(key): Hash Function

m

Page 3: COMP211slides_11

E.G.M. Petrakis Hashing 3

0 4967000 1 2 8421002 3 . . . . . .

395 396 4618396 397 4957397 398 399 1286399 400 401

. .

. .

. . 990 0000990 991 0000991 992 1200992 993 0047993 994 995 9846995 996 4618996 997 4967997 998 999 0001999

position key record

h(key) = key mod 1000h(key) = key mod 1000

Page 4: COMP211slides_11

E.G.M. Petrakis Hashing 4

Good Hash Functions1.

Uniform: distribute keys evenly in space

2.

Perfect: two records cannot occupy the same location or

3.

Order preserving:Difficult to find such hash functionsProperty 2 is the most essential Most functions are no better thanh(key) = key mod mHash collision: )h(k)h(k:kk jiji =≠∃

)()(:, jiji khkhjikk ≠≠≠∃)()(:, jiji khkhjikk ≤≠≤∃

Page 5: COMP211slides_11

E.G.M. Petrakis Hashing 5

Collision Resolution

1.

Open Addressing (rehashing): compute new position to store the key in the table (no extra space)

i.

linear probingii.

double hashing

2.

Separate Chaining: lists of keys mapped to the same position (uses extra space)

Page 6: COMP211slides_11

E.G.M. Petrakis Hashing 6

Open AddressingComputes a new address to store the key if it is occupied (rehashing)

if occupied too, compute a new address, …until an empty position is found primary hash function: i=h(key)rehash function: rh(i)=rh(h(key))hash sequence: (h0,h1,h2…) = (h(key), rh(h(key)), rh(rh(h(key)))…)

To find a key follow the same hash sequence

Page 7: COMP211slides_11

E.G.M. Petrakis Hashing 7

Example

i=h(key)=key mod 100rh(i) = (i+1) mod 100

key: 193i=h(193)=93rh(i)=(93+1)=94Key 193 will occupy position 94

0 100 1 101 2 . . .

.

.

. 90 990 91 991 92 992 93 993 94 . . .

.

.

. 100

193

Page 8: COMP211slides_11

E.G.M. Petrakis Hashing 8

Problem 1: Locate Empty Positions

No empty position can be foundi.

the table is full

check on number of empty positionsii.

the hash function fails to find an empty position although the table is not full !!

i=h(key) = key mod 1000rh(i) = (i + 200) mod 1000 => checks only 5positions on a table of 1000 positionsrh(i) = (i+1) mod 1000 successive positionsrh(i) = (i+c) mod 1000 where GCD(c,m) = 1

Page 9: COMP211slides_11

E.G.M. Petrakis Hashing 9

Problem 2: Primary Clustering

Different keys that hash into different addresses compete with each other in successive rehashes

i=h(key) = key mod 100rh(i) = (i+1) mod 100keys: 1990, 1991, 1992, 1993, 1994 => 94

0 100 1 101 2 . . .

.

.

. 90 990 91 991 92 992 93 993 94 . . .

.

.

. 100

Page 10: COMP211slides_11

E.G.M. Petrakis Hashing 10

Problem 3: Secondary Clustering

Different keys which hash to the same hash value have the same rehash sequence

i=h(key) = key mod 10rh(i,j) = (i + j) mod 10

i.

key 23 : h(23) = 3rh

= 4, 6, 9, 3, …

ii.

key 13 : h(13) = 3rh

= 4, 6, 9, 3, …

0 10 1 2 3 53 4 14 5 15 6 46 7 8 9

Page 11: COMP211slides_11

E.G.M. Petrakis Hashing 11

Linear Probing

Store the key into the next free position

h0 = h(key) usually h0 = key mod mhi = (hi-1 + 1) mod m, i >= 1

0 1 301 2 22 3 102 4 452 5 35 6 7 8 9 99

S = {22, 35, 301, 99, 102, 452}

Page 12: COMP211slides_11

E.G.M. Petrakis Hashing 12

Observation 1

Different insertion sequences => different hash sequences

S1={11,3,27,99,8,50,77,22,12,31,33,40,53}=>28 probesS2={53,40,33,31,12,22,77,50,8,99,27,3,11}=> 30 probes

H(key) = key mod 13H(key) = key mod 13

number of

probes0 17 21 27 12 12 43 3 14 40 45 31 16 53 67 33 18 99 19 8 210 22 211 11 112 50 2

Page 13: COMP211slides_11

E.G.M. Petrakis Hashing 13

Observation 2Deletions are not easy:

i=h(key) = key mod 10rh(i) = (i+1) mod 10

Action: delete(65) and search(5) Problem: search will stop at theempty position and will not find 5Solution:

mark position as deleted rather than empty the marked position can be reused

0 70 1 2 12 3 33 4 14 5 55 6 65 7 75 8 85 9 5

Page 14: COMP211slides_11

E.G.M. Petrakis Hashing 14

Observation 3

Linear probing tends to create long sequences of occupied positions

the longer a sequence is, the longer it tends to become P: probability to use a position in the cluster

Βmm

1BP +=

Page 15: COMP211slides_11

E.G.M. Petrakis Hashing 15

Observation 4

Linear probing suffers from both primary and secondary clusteringSolution: double hashing

uses two hash functions h1, h2 and arehashing function rh

Page 16: COMP211slides_11

E.G.M. Petrakis Hashing 16

Double Hashing

Two hash functions and a rehashing function

primary hash function i=h1(key)= key mod msecondary hash function h2(key)rehashing function: rh(key) = (i + h2(key)) mod m

h2(m,key) is some function of m, keyhelps rh in computing random positions in the hash tableh2 is computed once for each key!

Page 17: COMP211slides_11

E.G.M. Petrakis Hashing 17

Example of Double Hashing

i.

hash function: h1(key) = key mod m

q = (key div m) mod m ii.

rehash function: rh(i, key) = (i + h2(key)) mod m

⎩⎨⎧

≠=

=0qq0q2 div m(key)h2

Page 18: COMP211slides_11

E.G.M. Petrakis Hashing 18

Example (continued)A.

m = 10, key = 23h1

(23) = 3, h2

(23) = 2rh(3,2)=(3+2) mod 10 = 5rehash sequence: 5, 7, 9, 1, …

m = 10, key = 13h1

(key)=3, h2

(13)=1, rh(3,1)=(3+1)mod10=4rehash sequence: 4, 5, 6,…

Page 19: COMP211slides_11

E.G.M. Petrakis Hashing 19

Performance of Open Addressing

Distinguish betweensuccessful andunsuccessful search

Assume a series of probes to random positions

independent eventsload factor: λ = n/mλ: probability to probe an occupied positioneach position has the same probability P=1/m

Page 20: COMP211slides_11

E.G.M. Petrakis Hashing 20

Unsuccessful Search

The hash sequence is exhausted let u be the expected number of probes u equals the expected length of the hash sequenceP(k): probability to search k positions in the hash sequence

Page 21: COMP211slides_11

E.G.M. Petrakis Hashing 21

L

L

L

L

+≥+≥

++++

+++++

+

== ∑≥

2probes)P(1probes)P(________________________

P(k)P(k)P(k)

P(3)P(3)P(3)P(2)P(2)

P(1)

kP(k)u1k

Page 22: COMP211slides_11

E.G.M. Petrakis Hashing 22

λ11u

λλ

)ocupied positions 1k P(first

probes) kP(u

1k 1k

k1κ1k

1k

−=

⇒≤

=−

=≥=

∑ ∑

≥ ≥

−≥

independent events

u increases with λ =>performance drops as

λ increases

Page 23: COMP211slides_11

E.G.M. Petrakis Hashing 23

Successful Search

The hash sequence is not exhausted the number of probes to find a key equals the number of probes s at the time the key was inserted plus 1λ was less at that time consider all values of λ

)λ1

1ln(λ111)dx(u

λ1s

λ

0 −+=+= ∫

increases with λ

u: equivalent to unsuccessful search

approximation

Page 24: COMP211slides_11

E.G.M. Petrakis Hashing 24

Performance

The performance drops as λ increasesthe higher the value of λ is, the higher the probability of collisions

Unsuccessful search is more expensive than successful search

unsuccessful search exhausts the hash sequence

Page 25: COMP211slides_11

E.G.M. Petrakis Hashing 25

Experimental ResultsSUCCESSFUL UNSUCCESSFUL

LOAD FACTOR

LINEAR i + bkey DOUBLE LINEAR i + bkey DOUBLE

25% 1.17 1.16 1.15 1.39 1.37 1.33

50% 1.50 1.44 1.39 2.50 2.19 2.00

75% 2.50 2.01 1.85 8.50 4.64 4.00

90% 5.50 2.85 2.56 50.50 11.40 10.00

95% 10.50 3.52 3.15 200.50 22.04 20.00

Page 26: COMP211slides_11

E.G.M. Petrakis Hashing 26

Performance on Full TableSUCCESSFUL

TABLE SIZE (m)

LINEAR i + bkey DOUBLE UNSUCCESSFUL LOG2m

100 6.60 4.62 4.12 50.50 6.64

500 14.35 6.22 5.72 250.50 8.97

1000 20.15 6.91 6.41 500.50 9.97

5000 44.64 8.52 8.02 2500.5 12.29

10000 63.00 9.21 8.71 5000.50 13.29

Page 27: COMP211slides_11

E.G.M. Petrakis Hashing 27

Separate Chaining

Keys hashing to the same hash value are stored in separate lists

one list per hash positioncan store more than m records easy to implementthe keys in each list can be ordered

Page 28: COMP211slides_11

E.G.M. Petrakis Hashing 28

0

1

2

3 nil

4 nil

5

6

7

8 nil

9

nil

nil

nil

nil

nil

nil

nil

40 130

91

42

75

192

87 67

49

66 16

372

227417

h(key) = key mod m

Page 29: COMP211slides_11

E.G.M. Petrakis Hashing 29

Performance of Separate Chaining

Depends on the average chain sizeinsertions are independent eventslet P(c,n,m): probability that a position has been selected c times after ninsertions on a table of size mP(c,n,m): probability that the chain has length c => binomial distribution

cncqpcnm)n,P(c, −⎟⎟⎠

⎞⎜⎜⎝

⎛=

p=1/m: success caseq=1-p: failure case

Page 30: COMP211slides_11

E.G.M. Petrakis Hashing 30

nc

cnc

m11

m11

m1cn

mn

c!1

m11

m1

cnm)n,P(c,

⎟⎠⎞

⎜⎝⎛ −⎟

⎠⎞

⎜⎝⎛ −

+−

=⎟⎠⎞

⎜⎝⎛ −⎟

⎠⎞

⎜⎝⎛⎟⎟⎠

⎞⎜⎜⎝

⎛=

λn

c

em11

1m11

λm

1cn

→⎟⎠⎞

⎜⎝⎛ −

→⎟⎠⎞

⎜⎝⎛ −

→+−

=> P(c,n,m)=(1/c!)λce-λ

Poison

∞→mn, =>

Page 31: COMP211slides_11

E.G.M. Petrakis Hashing 31

Unsuccessful Search

The entire chain is searched the average number of comparisons equals its average length u

λec!λcλ)cP(c,u λ

0c0c=== −

≥≥∑∑

Page 32: COMP211slides_11

E.G.M. Petrakis Hashing 32

Successful Search

Not the whole chain is searchedthe average number of comparisons equals the length s of the chain at time the key was inserted plus 1the performance at the time a key was inserted equals that of unsuccessful search!

∫∫ +=+=+=λ

0

λ

0 2λ11)dx(x

λ11)dx(u

λ1s

Page 33: COMP211slides_11

E.G.M. Petrakis Hashing 33

Performance

The performance drops with the length of the chains

worst case: all keys are stored in a single chainworst case performance: O(N)unsuccessful search performs better than successful search!! WHY ?no problem with deletions!!

Page 34: COMP211slides_11

E.G.M. Petrakis Hashing 34

Coalesced Hashing

The hash sequence is implemented as a linked list within the hash table

no rehash function the next hash position is the next available position in linked listextra space for the list

0 … … 1 … … 2 49 7 3 … … 4 … … 5 29 2 6 … … 7 59 - 8 … … 9 19 5

h(key) = key mod 10keys: 19, 29, 49, 59

Page 35: COMP211slides_11

E.G.M. Petrakis Hashing 35

0 nilkey -1 1 nilkey 0 2 . 1 3 . 2 4 . 3 5 . 4 6 . 5 7 . 6 8 . 7 9 nilkey 8

avail

initially:

avail = 9h(key) = key mod 10

keys: 14,29,34,28,42,39,84,38

0 nilkey -1 1 nilkey 0 2 42 -1 3 38 -1 4 14 8 5 84 -1 6 39 -1 7 28 3 8 34 5 9 29 6

initialization

List of empty positions

Holds lists ofrehashingpositions and list of emptypositions

Page 36: COMP211slides_11

E.G.M. Petrakis Hashing 36

Performance of Coalesced Hashing

Unsuccessful search

Successful search

0.752λe

41 2λ +−

0.754λ

8λ1e2λ

++−

probes/search

probes/search