E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … ...
-
date post
15-Jan-2016 -
Category
Documents
-
view
214 -
download
0
Transcript of E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … ...
![Page 1: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/1.jpg)
E.G.M. Petrakis Hashing 1
Hashing
Data organization in main memory or disk sequential, binary trees, …
The location of a key depends on other keys => unnecessary key comparisons to find a key
Question: find key with a single comparison Hashing: the location of a record is
computed using its key only Fast for random accesses - slow for range
queries
![Page 2: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/2.jpg)
E.G.M. Petrakis Hashing 2
Hash Table
Hash Function: transforms keys to array indices
n
.
.
.
4
3
2
1
0
h(key)dataindex
h(key): Hash Function
m
![Page 3: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/3.jpg)
E.G.M. Petrakis Hashing 3
0 4967000 1 2 8421002 3 . . . . . .
395 396 4618396 397 4957397 398 399 1286399 400 401
. .
. .
. . 990 0000990 991 0000991 992 1200992 993 0047993 994 995 9846995 996 4618996 997 4967997 998 999 0001999
position
key record
h(key) = key mod 1000h(key) = key mod 1000
![Page 4: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/4.jpg)
E.G.M. Petrakis Hashing 4
Good Hash Functions
1. Uniform: distribute keys evenly in space
2. Perfect: two records cannot occupy the same location or
3. Order preserving: Difficult to find such hash functions Property 2 is the most essential Most functions are no better than
h(key) = key mod m Hash collision:
)h(k)h(k:kk jiji
)()(:, jiji khkhjikk )()(:, jiji khkhjikk
![Page 5: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/5.jpg)
E.G.M. Petrakis Hashing 5
Collision Resolution
1. Open Addressing (rehashing): compute new position to store the key in the table (no extra space)
i. linear probingii. double hashing
2. Separate Chaining: lists of keys mapped to the same position (uses extra space)
![Page 6: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/6.jpg)
E.G.M. Petrakis Hashing 6
Open Addressing
Computes a new address to store the key if it is occupied (rehashing)
if occupied too, compute a new address, … until an empty position is found
primary hash function: i=h(key) rehash function: rh(i)=rh(h(key)) hash sequence: (h0,h1,h2…) = (h(key),
rh(h(key)), rh(rh(h(key)))…) To find a key follow the same hash
sequence
![Page 7: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/7.jpg)
E.G.M. Petrakis Hashing 7
Example
i=h(key)=key mod 100
rh(i) = (i+1) mod 100 key: 193 i=h(193)=93 rh(i)=(93+1)=94 Key 193 will occupy
position 94
0 100 1 101 2 . . .
.
.
. 90 990 91 991 92 992 93 993 94 . . .
.
.
. 100
193
![Page 8: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/8.jpg)
E.G.M. Petrakis Hashing 8
Problem 1: Locate Empty Positions
No empty position can be found i. the table is full
check on number of empty positionsii. the hash function fails to find an empty
position although the table is not full !! i=h(key) = key mod 1000 rh(i) = (i + 200) mod 1000 => checks only 5
positions on a table of 1000 positions rh(i) = (i+1) mod 1000 successive positions rh(i) = (i+c) mod 1000 where GCD(c,m) = 1
![Page 9: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/9.jpg)
E.G.M. Petrakis Hashing 9
Problem 2: Primary Clustering
Different keys that hash into different addresses compete with each other in successive rehashes i=h(key) = key mod 100 rh(i) = (i+1) mod 100 keys: 1990, 1991, 1992,
1993, 1994 => 94
0 100 1 101 2 . . .
.
.
. 90 990 91 991 92 992 93 993 94 . . .
.
.
. 100
![Page 10: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/10.jpg)
E.G.M. Petrakis Hashing 10
Problem 3: Secondary Clustering
Different keys which hash to the same hash value have the same rehash sequence i=h(key) = key mod 10 rh(i,j) = (i + j) mod 10i. key 23 : h(23) = 3
rh = 4, 6, 9, 3, …ii. key 13 : h(13) = 3
rh = 4, 6, 9, 3, …
0 10 1 2 3 53 4 14 5 15 6 46 7 8 9
![Page 11: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/11.jpg)
E.G.M. Petrakis Hashing 11
Linear Probing
Store the key into the next free position h0 = h(key) usually h0 = key mod m
hi = (hi-1 + 1) mod m, i >= 1
0
1 301 2 22 3 102 4 452 5 35 6 7 8 9 99
S = {22, 35, 301, 99, 102, 452}
![Page 12: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/12.jpg)
E.G.M. Petrakis Hashing 12
Observation 1
Different insertion sequences => different hash sequences S1={11,3,27,99,8,50,7
7,22,12,31,33,40,53}=>28 probes
S2={53,40,33,31,12,22,77,50,8,99,27,3,11}=> 30 probes
H(key) = key mod 13H(key) = key mod 13
number
of probes0 17 2
1 27 1 2 12 4 3 3 1 4 40 4 5 31 1 6 53 6 7 33 1 8 99 1 9 8 2
10 22 2 11 11 1 12 50 2
![Page 13: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/13.jpg)
E.G.M. Petrakis Hashing 13
Observation 2
Deletions are not easy: i=h(key) = key mod 10 rh(i) = (i+1) mod 10
Action: delete(65) and search(5) Problem: search will stop at the
empty position and will not find 5 Solution:
mark position as deleted rather than empty the marked position can be reused
0 70
1
2 12
3 33
4 14
5 55
6 65
7 75
8 85
9 5
![Page 14: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/14.jpg)
E.G.M. Petrakis Hashing 14
Observation 3
Linear probing tends to create long sequences of occupied positions the longer a
sequence is, the longer it tends to become
P: probability to use a position in the cluster
Βmm
1BP
![Page 15: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/15.jpg)
E.G.M. Petrakis Hashing 15
Observation 4
Linear probing suffers from both primary and secondary clustering
Solution: double hashing uses two hash functions h1, h2 and a
rehashing function rh
![Page 16: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/16.jpg)
E.G.M. Petrakis Hashing 16
Double Hashing
Two hash functions and a rehashing function primary hash function i=h1(key)= key mod m
secondary hash function h2(key)
rehashing function: rh(key) = (i + h2(key)) mod m
h2(m,key) is some function of m, key helps rh in computing random positions in the
hash table h2 is computed once for each key!
![Page 17: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/17.jpg)
E.G.M. Petrakis Hashing 17
Example of Double Hashing
i. hash function: h1(key) = key mod m
q = (key div m) mod m ii. rehash function: rh(i, key) = (i + h2(key)) mod m
0qq
0q2 div m(key)h2
![Page 18: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/18.jpg)
E.G.M. Petrakis Hashing 18
Example (continued)
A. m = 10, key = 23
h1(23) = 3, h2(23) = 2
rh(3,2)=(3+2) mod 10 = 5
rehash sequence: 5, 7, 9, 1, …
m = 10, key = 13
h1(key)=3, h2(13)=1, rh(3,1)=(3+1)mod10=4
rehash sequence: 4, 5, 6,…
![Page 19: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/19.jpg)
E.G.M. Petrakis Hashing 19
Performance of Open Addressing
Distinguish between successful and unsuccessful search
Assume a series of probes to random positions independent events load factor: λ = n/m λ: probability to probe an occupied position each position has the same probability P=1/m
![Page 20: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/20.jpg)
E.G.M. Petrakis Hashing 20
Unsuccessful Search
The hash sequence is exhausted let u be the expected number of
probes u equals the expected length of the
hash sequence P(k): probability to search k positions
in the hash sequence
![Page 21: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/21.jpg)
E.G.M. Petrakis Hashing 21
2probes)P(1probes)P(
________________________
P(k)P(k)P(k)
P(3)P(3)P(3)
P(2)P(2)
P(1)
kP(k)u1k
![Page 22: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/22.jpg)
E.G.M. Petrakis Hashing 22
λ11
u
λλ
)ocupied positions 1k P(first
probes) kP(u
1k 1k
k1κ
1k
1k
independent events
u increases with λ =>performance drops as
λ increases
![Page 23: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/23.jpg)
E.G.M. Petrakis Hashing 23
Successful Search
The hash sequence is not exhausted
the number of probes to find a key equals the number of probes s at the time the key was inserted plus 1
λ was less at that time consider all values of λ
)λ1
1ln(
λ1
11)dx(uλ1
sλ
0
increases with λ
u: equivalent to unsuccessful search
approximation
![Page 24: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/24.jpg)
E.G.M. Petrakis Hashing 24
Performance
The performance drops as λ increases the higher the value of λ is, the higher
the probability of collisions
Unsuccessful search is more expensive than successful search unsuccessful search exhausts the
hash sequence
![Page 25: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/25.jpg)
E.G.M. Petrakis Hashing 25
Experimental ResultsSUCCESSFUL UNSUCCESSFUL
LOAD FACTOR
LINEAR i + bkey DOUBLE LINEAR i + bkey DOUBLE
25% 1.17 1.16 1.15 1.39 1.37 1.33
50% 1.50 1.44 1.39 2.50 2.19 2.00
75% 2.50 2.01 1.85 8.50 4.64 4.00
90% 5.50 2.85 2.56 50.50 11.40 10.00
95% 10.50 3.52 3.15 200.50 22.04 20.00
![Page 26: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/26.jpg)
E.G.M. Petrakis Hashing 26
Performance on Full TableSUCCESSFUL
TABLE SIZE (m)
LINEAR i + bkey DOUBLE
UNSUCCESSFUL LOG2m
100 6.60 4.62 4.12 50.50 6.64
500 14.35 6.22 5.72 250.50 8.97
1000 20.15 6.91 6.41 500.50 9.97
5000 44.64 8.52 8.02 2500.5 12.29
10000 63.00 9.21 8.71 5000.50 13.29
![Page 27: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/27.jpg)
E.G.M. Petrakis Hashing 27
Separate Chaining
Keys hashing to the same hash value are stored in separate lists
one list per hash position can store more than m records easy to implement the keys in each list can be ordered
![Page 28: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/28.jpg)
E.G.M. Petrakis Hashing 28
0
1
2
3 nil
4 nil
5
6
7
8 nil
9
nil
nil
nil
nil
nil
nil
nil
40 130
91
42
75
192
87 67
49
66 16
372
227417
h(key) = key mod m
![Page 29: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/29.jpg)
E.G.M. Petrakis Hashing 29
Performance of Separate Chaining
Depends on the average chain size insertions are independent events let P(c,n,m): probability that a
position has been selected c times after n insertions on a table of size m
P(c,n,m): probability that the chain has length c => binomial distribution
cncqpc
nm)n,P(c,
p=1/m: success caseq=1-p: failure case
![Page 30: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/30.jpg)
E.G.M. Petrakis Hashing 30
nc
cnc
m1
1m1
1m
1cnmn
c!1
m1
1m1
c
nm)n,P(c,
λn
c
em1
1
1m1
1
λm
1cn
=> P(c,n,m)=(1/c!)λce-λ
Poison
mn, =>
![Page 31: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/31.jpg)
E.G.M. Petrakis Hashing 31
Unsuccessful Search
The entire chain is searched the average number of comparisons
equals its average length u
λec!λ
cλ)cP(c,u λ
0c0c
![Page 32: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/32.jpg)
E.G.M. Petrakis Hashing 32
Successful Search
Not the whole chain is searched the average number of comparisons
equals the length s of the chain at time the key was inserted plus 1
the performance at the time a key was inserted equals that of unsuccessful search!
λ
0
λ
02λ
11)dx(xλ1
1)dx(uλ1
s
![Page 33: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/33.jpg)
E.G.M. Petrakis Hashing 33
Performance
The performance drops with the length of the chains worst case: all keys are stored in a
single chain worst case performance: O(N) unsuccessful search performs better
than successful search!! WHY ? no problem with deletions!!
![Page 34: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/34.jpg)
E.G.M. Petrakis Hashing 34
Coalesced Hashing
The hash sequence is implemented as a linked list within the hash table no rehash function the next hash position
is the next available position in linked list
extra space for the list
0 … … 1 … … 2 49 7 3 … … 4 … … 5 29 2 6 … … 7 59 - 8 … … 9 19 5
h(key) = key mod 10keys: 19, 29, 49, 59
![Page 35: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/35.jpg)
E.G.M. Petrakis Hashing 35
0 nilkey -1 1 nilkey 0 2 . 1 3 . 2 4 . 3 5 . 4 6 . 5 7 . 6 8 . 7 9 nilkey 8
avail
initially: avail = 9
h(key) = key mod 10
keys: 14,29,34,28,42,39,84,38
0 nilkey -1 1 nilkey 0 2 42 -1 3 38 -1 4 14 8 5 84 -1 6 39 -1 7 28 3 8 34 5 9 29 6
initialization
List of empty positions
Holds lists ofrehashingpositions and list of emptypositions
![Page 36: E.G.M. PetrakisHashing1 Data organization in main memory or disk sequential, binary trees, … The location of a key depends on other keys => unnecessary.](https://reader034.fdocuments.in/reader034/viewer/2022052702/56649d3b5503460f94a15b6a/html5/thumbnails/36.jpg)
E.G.M. Petrakis Hashing 36
Performance of Coalesced Hashing
Unsuccessful search
Successful search
0.752λ
e41 2λ
0.754λ
8λ1e2λ
probes/search
probes/search