AADS_14_Hash Tables & Hash Functions

96
Hash Tables & Hash Functions AADS-14

Transcript of AADS_14_Hash Tables & Hash Functions

Page 1: AADS_14_Hash Tables & Hash Functions

Hash Tables & Hash Functions

AADS-14

Page 2: AADS_14_Hash Tables & Hash Functions

Significance of complexity of SearchIn an unordered list the time required to find

a value is O(n)In an ordered list this time can be improved,

and there could definitely be improvement in the modification operations

In a Binary Search Tree the search time could well improve to O(log n)

Same is the limit for AVL trees

Page 3: AADS_14_Hash Tables & Hash Functions

Dictionary Data StructureDictionary is a general form of Data Structure to

store key and valuesIt can be implemented using Array or Linked List

structuresFor a Dictionary, the direct addressing of each

element could be done using the value of the element as index, if the Dictionary is of that size

Page 4: AADS_14_Hash Tables & Hash Functions

Key Search/Dictionary StorageBut in any of the complex applications the memory is

simultaneously used by many processesAlso there could be frequent accesses to the Keys in

the runtimeSo, there is a need for reducing both size of space

and the search time

Page 5: AADS_14_Hash Tables & Hash Functions

Example-1A 4 digit number as Key may need 9999 locations If the Key stands for the Employee ID of a company

with 500 employees, Then only 500 locations shall be used when all the

Keys are arranged in the memory

Page 6: AADS_14_Hash Tables & Hash Functions

Example 2A Hospital might be having large number of patients,

both inpatients and outpatientsThe database system can be modeled to group the

patients and then index them so that the retrieval of the records shall be fast.

Another way is not to group, but assign only one number to each case

Page 7: AADS_14_Hash Tables & Hash Functions

Search time of O(1)In both cases of large data or small amount of data,

the amortized time of O(1) or a near about time could be achieved if we know the location of the data or key we are looking for

This location could be obtained from a mapping of the key to a new hashed key using proper functions

Page 8: AADS_14_Hash Tables & Hash Functions

Hashing Hashing could provide unique locations or a

reference to a shorter list for the keys from where we can easily get the data pertaining to one key

Also, this would perhaps use less space in memory Instead of a large array, we can use a short length

array/linked list

Page 9: AADS_14_Hash Tables & Hash Functions

Hash TableHash Table is a Data StructureHash tables provide the time O(1) for any

and all values in a set contained on the Hash Table for search/insert/delete

Page 10: AADS_14_Hash Tables & Hash Functions

Hash Table?Hash table is an array say T[1,m] where m is a

positive integer called the table sizeWhen we try to put an item into a spot in the

hash table that is occupied, the situation is called collision

It is resolved using a collision resolution policy

Page 11: AADS_14_Hash Tables & Hash Functions

Hashing-Mathematical DefinitionHashing is a mapping operationConsider the a set K of keysLet H be a function that map the keys to a new set LSuch that

H:K L

Page 12: AADS_14_Hash Tables & Hash Functions

Hash Function/ & Hash AddressThe function H is called the HASH FUNCTION

This mapping done by the function H is called the HASHING

The object L is the Hash table

Each cell/location in L is identified using the Hash address

Page 13: AADS_14_Hash Tables & Hash Functions

Hash AddressLet k is Key in K or k KThen k will have a mapped address in L given by

H(k) known as the Hash AddressHash Address d is the mapped address/location

given by the hashing operation

d=H(k)

of a key k

Page 14: AADS_14_Hash Tables & Hash Functions

Indexing on the Hash table The hash address d shall directly point to a location

in LThis address d is also called the Hash Address or

Hash Code for the key kThe process of Hashing is also called Compression

Page 15: AADS_14_Hash Tables & Hash Functions

NotesThere is no meaning between the actual data value

k and the hash key dSo there is no practical way to traverse a hash table,

except a direct search using dHash table items are not in any orderThere is no mapping function from d to k, except the

hash tableThe purpose of hash tables is to provide fast look

ups

Page 16: AADS_14_Hash Tables & Hash Functions

Illustration- Bucket Array Structure for Hash Table

1k1

2k2

3k3

L-1kN-1

LkN

Page 17: AADS_14_Hash Tables & Hash Functions

Uses of Hash TablesCompilers use hash tables for symbol storage.The Linux Kernel uses hash tables to manage

memory pages and buffers.High speed routing tables use hash tables.Database systems use hash tables.

Page 18: AADS_14_Hash Tables & Hash Functions

Operations on Hash TablesInitializeInsert(k)Search(k)Remove(k)SizeofIsempty

Page 19: AADS_14_Hash Tables & Hash Functions

Types of hashingThere are two types

1. Open hashing- Open Chaining-Closed Addressing-Separate Chaining

2. Closed hashing- Open Addressing

Page 20: AADS_14_Hash Tables & Hash Functions

Open hashing-Open ChainingAmount of data to be stored is highUses a hash function to obtain the hash addressAll data with same hash address shall be stored as a

shorter list with a reference indicated by the above hash address

Page 21: AADS_14_Hash Tables & Hash Functions

Bucket in Open hashingEach hash location on the Hash table is said to a

bucket for the data with an indexData within the bucket could better be organized as

Linked List

1k1

2k2

3k3

L-1kN-1

LkN

Page 22: AADS_14_Hash Tables & Hash Functions

Closed hashing-Open AddressingClosed hashing uses a fixed spaceHashing shall map a key into one of the locations in

the earmarked spaceIf there are multiple keys getting hashed to same

address(collision) then the tie shall be resolvedBucket may be small enough to hold only one value

at a time

Page 23: AADS_14_Hash Tables & Hash Functions

Topics in HashingBasically there are two subareas under “Hashing”

1. Hash Functions

2. Collision Resolutions

Page 24: AADS_14_Hash Tables & Hash Functions

Hash Functions

1. The Hash Function H should be easy to compute

2. The function H should, as far as possible, uniformly distribute the hash addresses throughout the set L so that there are a minimum number of collisions

Page 25: AADS_14_Hash Tables & Hash Functions

Hash Functions

Page 26: AADS_14_Hash Tables & Hash Functions

Requirement of Hash FunctionsThe main idea of using Hash Function H is that for a

key k, the hash function H obtains a value H(k) as an index into the hash table cell/bucket so that we can locate the key k in the Hash Table easily for search/insert

Page 27: AADS_14_Hash Tables & Hash Functions

Hash FunctionsDivision MethodMid Square methodMultiplication Method

Page 28: AADS_14_Hash Tables & Hash Functions

Division MethodChoose a prime number that is not close to the

power of 2Let m be the selected numberThen m also indicate the size of the Hash Table in

the ideal case with one cell in each bucketThe hash address/bucket address is given by

H(k)=k mod m

Page 29: AADS_14_Hash Tables & Hash Functions

ExampleGiven keys are

4845, 5679, 6381, 3636, 7180, 8126, 1127

Use Table size m=7

Hash to a Table with 7 cells

Also use m=11

and m=8 to repeat the exercise

Page 30: AADS_14_Hash Tables & Hash Functions

Answer

01127

14845

25679

33636

46381

57180

68126

HASH ADDRESS

KEY

Page 31: AADS_14_Hash Tables & Hash Functions

Choosing Table size in Division MethodWhen using the division method, ample

consideration must be given to the size of the table.

The best choice for table size is usually a prime number not too close to a power of 2.

Page 32: AADS_14_Hash Tables & Hash Functions

Division Method for Chaining-

Here, the Hash Table will have many cellsHash addresses map multiple keys to a single location, So, there could be multiple entries in one location, These multiple entries under a single hash Code are

held as a linked list

Page 33: AADS_14_Hash Tables & Hash Functions

IllustrationTake Table size m as 11 to map a set keysKeys –

Modulo Divide each by 11 and get the hash addresses

122 221 661

90 167 57

69

Page 34: AADS_14_Hash Tables & Hash Functions

Answer- We get the following Table

1 111 221 551

2 90 167 57

3 69

0

4

Page 35: AADS_14_Hash Tables & Hash Functions

Load FactorLet there are m slots in a Hash TableAt the instant of observation the number

elements is nTherefore the Load factor =n/m This is the average number of element stored

in the Hash Table can be less than, equal to or greater than 1

Page 36: AADS_14_Hash Tables & Hash Functions

Find the Load Factor 0 110

1 89 452 68

167 57

34 225 554

9 108

5 82

10 109

Page 37: AADS_14_Hash Tables & Hash Functions

SolutionThere are 11 slots11 elements = 11/11=1So, indicates the average number of elements per

positionAlso, we get =1 even if there are vacant slots,

because it is only showing the average

Page 38: AADS_14_Hash Tables & Hash Functions

Notes on The Load factor could be assuming various values

as the number of keys on the Hash Table changesAccordingly, could be less than, equal, or greater

than one in a Hash Table formed using Separate Chaining(Open Hashing)

In a Hash Table formed using Open Addressing(Closed Hashing) shall be always less than one

decides the complexity of the operations on the Hash Tables like insert, search, delete etc

Page 39: AADS_14_Hash Tables & Hash Functions

Hashing the Strings

Page 40: AADS_14_Hash Tables & Hash Functions

ExerciseMap the following keys in such a way that we have

the hash function as followsFind the ASCII values of first and last charactersIf there is only one character, it shall be the start and

endAdd the ASCII value of last character to the ASCII

value of first multiplied by 256Apply mod m division to this resulting number

Page 41: AADS_14_Hash Tables & Hash Functions

KeysA, BABU, CHOWHAN, SUMAN, DILIP

Page 42: AADS_14_Hash Tables & Hash Functions

The 5 symbols are:AA BUCNSNDP

These 5 symbols are then converted to a numerical code using the rule given previously by employing the ASCII values of the characters in the symbols

Page 43: AADS_14_Hash Tables & Hash Functions

ASCII ValuesA-65B-66C-67D-68E-69F-70G-71H-72I-73

J-74K-75L-76M-77N-78O-79P-80Q-81R-82

S-83T-84U-85V-86W-87X-88Y-89Z-90

Page 44: AADS_14_Hash Tables & Hash Functions

Example- AnswerAA 256*65+65=16705BU 256*66+85=16981CN 256*67+78=17320SN 256* 83+78=21326DP 256*68+ 80=17488

A-65B-66C-67D-68E-69F-70G-71H-72 I-73

J-74K-75L-76M-77N-78O-79P-80Q-81R-82

S-83T-84U-85V-86W-87X-88Y-89Z-90

Page 45: AADS_14_Hash Tables & Hash Functions

SolutionTake m=7Obtain the Hash Addresses

AA 256*65+65=16705mod 7=3BU 256*66+85=16981mod7=6CN 256*67+78=17320mod7=2SN 256* 83+78=21326mod7=4DP 256*68+ 80=17488mod7=2

Page 46: AADS_14_Hash Tables & Hash Functions

Solution

1

2 CHOWHAN DILIP

3 AA

4 SUMAN

0

5

6 BABU

Page 47: AADS_14_Hash Tables & Hash Functions

Symbol TableCompilers use a method similar to the previous one

to form a symbol table for the parsing purposes in the compilation

Page 48: AADS_14_Hash Tables & Hash Functions

Hash Functions for string hashingHash Functions perform two separate functions:

1 – Convert the string to a key.

2 – Constrain the key to a positive value less than the size of the table.

The best strategy is to keep the two functions separate so that there is only one part to change if the size of the table changes.

Page 49: AADS_14_Hash Tables & Hash Functions

Notes-Chaining methodThe chaining method gives infinite space in the hash

table in principleBut, in practical applications, only limited space shall

be allotted for one hash table in the memoryThere is no collision in chaining

Page 50: AADS_14_Hash Tables & Hash Functions

Collisions

Page 51: AADS_14_Hash Tables & Hash Functions

CollisionIn the case of closed hashing(open addressing)-

even though H is ideally giving distinct addresses in L for each member in K in the real situation two or more Keys may LEAD TO A SINGLE Hash Address when a given Hash Function is used

This situation is called collisionWe need some method to resolve collisionThe method is called “Collision Resolution Policy”

Page 52: AADS_14_Hash Tables & Hash Functions

Collision Resolution PolicyLinear ProbingQuadratic ProbingDouble Hashing

Page 53: AADS_14_Hash Tables & Hash Functions

Linear ProbingIf a collision occurs, look for next immediate free

location and use it for storage for the insert operationIf a key is not found, look for it in the next cells in a

linear manner for search operations

Page 54: AADS_14_Hash Tables & Hash Functions

ExampleLet H is mod 11Let the keys are 56, 78, 100 appear in this order for

hashingAll these have home as position 1The table is considered a circular array

0 156

278

3100

8 9 10

4

Page 55: AADS_14_Hash Tables & Hash Functions

ExerciseHash 45, 39, 66, 74 in that order with Table size m=7

0 1 2 345

566

674

439

45 mod 7=339 mod 7 = 466 mod 7 =374 mod 7=4

Page 56: AADS_14_Hash Tables & Hash Functions

ExerciseLet H is mod 11Let the keys are 46, 122, 222, 441 appear in this order

for hashing

46 mod 11 = 2122 mod 11 = 1222 mod 11 = 2441 mod 11 = 1

Page 57: AADS_14_Hash Tables & Hash Functions

Solution

0 1122

246

3222

8 9 10

4441

Page 58: AADS_14_Hash Tables & Hash Functions

More on Hash Functions

Page 59: AADS_14_Hash Tables & Hash Functions

Mid Square Method of hashing

Page 60: AADS_14_Hash Tables & Hash Functions

Mid square method1. The key k is squared to get k2

2. This value is now treated as a string of digits

3. Then hash function H(k) is defined as H(k)=f

4. This f is given by deleting the digits from both ends of k2

5. Once chosen, same positions of k2 must be used for all keys consistently

Page 61: AADS_14_Hash Tables & Hash Functions

Examplek: 3205 7148 2345k2 : 10 272 025 51 093 904 5 499 025H(k) 72 93 99

Page 62: AADS_14_Hash Tables & Hash Functions

Multiplication Method for hashing

Page 63: AADS_14_Hash Tables & Hash Functions

Multiplication method for HashingThis method uses a hashing which is different from

the Division methodThe function take the form

H(k)=m(kA mod 1)=floor(m* (kA mod 1)

Where, 0<A<1 and kA mod 1 refers to the fractional part of kA

Since 0< kA mod 1<1, the range of H(k) is from 0 to m

Page 64: AADS_14_Hash Tables & Hash Functions

Advantage of Multiplication MethodThe advantage of the multiplication method is that it

works equally well with any size mA should be chosen carefullyRational numbers should not chosen for AAn example of good choice for A is

2

15

Page 65: AADS_14_Hash Tables & Hash Functions

Obtain the Hash Codes for the keys2343, 4345, 6567, 3476, 1215m=11, A=0.618

2343 floor(11* (2343* 0.618 mod 1) 10

4345 floor(11* (4345* 0.618 mod 1) 26567 floor(11* (6567* 0.618 mod 1) 43476 floor(11* (3476* 0.618 mod 1) 11215 floor(11* (1215* 0.618 mod 1) 9MATLAB command – floor(11*mod((k*0.618),1))

2

15 A

Page 66: AADS_14_Hash Tables & Hash Functions

Solution

0 13476

24345

3

8 91215

102343

46567

Page 67: AADS_14_Hash Tables & Hash Functions

More on Collision Resolution

Page 68: AADS_14_Hash Tables & Hash Functions

Quadratic Probing for Collision Resolution

Page 69: AADS_14_Hash Tables & Hash Functions

Notes on Linear ProbingLinear probing is simple to programLinear probing has better locality of reference and

hence better cache performance in the memory usage

Page 70: AADS_14_Hash Tables & Hash Functions

Primary Clustering in Linear ProbingLinear probing use a probe sequence H+1, H+2,

H+3 and so on to find the space of the key, which has got the primary hash value as H

This would lead to clustering of hash codes near some cells, called primary clustering

Larger the cluster, lesser will be the search efficiency

Page 71: AADS_14_Hash Tables & Hash Functions

Uniform Hashing & Random ProbingIf use a method to generate Hash codes in a

uniformly distributed manner with a larger table size the process may avoid collisions

Even if collisions occur we may use a pseudo random sequence to probe the locations

But this approach reduces the locality reference, which then becomes a random variable

So, better to use a via media solution between the linear probing and the random hashing

Page 72: AADS_14_Hash Tables & Hash Functions

Quadratic ProbingInstead of linearly traversing through the hash table

slots in the case of collisions, the quadratic probing introduces more spacing between the slots we try in the case of collision

This reduces the clustering effect seen in linear probing

Clustering can still occur because Quadratic Probing is not immune to clustering

Quadratic Probing preserves some locality reference and hence give good cache performance but lower than that of Linear Probing

Page 73: AADS_14_Hash Tables & Hash Functions

Hash Function for quadratic probingH(k,i)=(H’(k)+c1*i + c2 i2 ) mod m

Where c1 and c2 are constants, (auxiliary constants)H’ is an auxiliary hash function. It could be k mod mi=0,1,2,…,m-1 is called the probe numberFor a given Hash table the c1 and c2 remain

constantChoices for c1 and c2 are c1 = c2 =½, c1 = c2 =1, c1

= 0, c2 =1,

Page 74: AADS_14_Hash Tables & Hash Functions

Examplec1 = c2 =½,Take m= 11Let the keys are 46, 122, 222, 441 appear in this

order for hashing

46 mod 11 = 2122 mod 11 = 1222 mod 11 = 2 (2+0.5 *1 + 0.5*1) mod 11441 mod 11 = 1 (1+0.5 *1 + 0.5*1) mod 11

Page 75: AADS_14_Hash Tables & Hash Functions

ExerciseApply Quadratic Probing for the following Hash

Addresses78 mod 11 =189 mod 11 =1111 mod 11=1166 mod 11=1

Page 76: AADS_14_Hash Tables & Hash Functions

Answer78 mod 11 =1 189 mod 11 =1 (1+0.5 *1 + 0.5*12 ) mod 11 2111 mod 11=1 (1+0.5 *2 + 0.5*22 ) mod 11 4166 mod 11=1 (1+0.5 *3 + 0.5*32 ) mod 11 7

Page 77: AADS_14_Hash Tables & Hash Functions

NotesIf two keys have the same initial probe position, then

their probe sequences are the same, since H(k1, 0)=H(k2, 0) implies H(k1, i)=H(k2, i)

This property leads to milder form of clustering called secondary clustering

Page 78: AADS_14_Hash Tables & Hash Functions

Clustering

Page 79: AADS_14_Hash Tables & Hash Functions

Problems with Linear ProbingLinear probing leads to Primary Clustering- the

hashed keys share substantial segments of probe sequence, because more than one key hashed into same home position shall have the same probe sequence

And the hash addresses that collide at the home address, say b, will extend the cluster

Page 80: AADS_14_Hash Tables & Hash Functions

Primary ClusteringAs we have seen, once a block of few contiguous

occupied positions emerges in the Hash Table, it becomes a “target” for subsequent collisions

As clusters grow, they also merge to form larger clusters

Primary clustering means – elements that hash to different cells probe same alternative cells

Clustering will be reduced only if the hash addresses home at different positions

Page 81: AADS_14_Hash Tables & Hash Functions

ExampleSuppose we have 10 Hash Codes with value 1 and

5 Hash Codes with Value 2All these codes shall be clustering around 1 and 2

Page 82: AADS_14_Hash Tables & Hash Functions

Problems with Quadratic ProbingThere could be adjacent clusters that join to form

composite clustersThis is called secondary clustering

This happens because the keys which have the same home hash address, will lead to same probe sequence

In Quadratic probing also, the probe sequence is a function of the home position and not the original key value

Page 83: AADS_14_Hash Tables & Hash Functions

Double hashing for Collision Resolution

Page 84: AADS_14_Hash Tables & Hash Functions

Double HashingTo avoid secondary clustering, we need to have the

probe sequence that make use of the original key value in its decision process

This is achieved using Double Hashing, because the Hashing is done in two stages

We shall use a second hash function also, so as to reduce the collisions

Page 85: AADS_14_Hash Tables & Hash Functions

Double HashingLet H1(k) and H2(k) be two hash functions for the

same key kThe H(k) is obtained as

H(k,i)= {H1(k) + i* H2(k)} mod m for the ith probe sequence

If the Table size m is a prime number the above sequence is likely to access all locations in the Hash Table

Page 86: AADS_14_Hash Tables & Hash Functions

NotesThe functions H1(k) and H2(k) are auxiliary hash

functions, which are selected like any hash function: so that the Keys are distributed in a uniform and random manner.

Page 87: AADS_14_Hash Tables & Hash Functions

Example 1We let H1(k) = k mod m and H2(k) = 1 + (k mod m' ),

where m' is slightly less than m, say, m – 1 or m – 2.For example m=11 and m’=9

Page 88: AADS_14_Hash Tables & Hash Functions

Example 2First Use Mid Square Method and then use the

Modulo Division

Page 89: AADS_14_Hash Tables & Hash Functions

Double hashingDouble hashing can be used to avoid the primary and

secondary clusteringH2(k) must be chosen with care

m and H2(k) must be relatively prime and this can be effected by making m a prime number

If m is a power of two then choose H2(k) which is always odd

Page 90: AADS_14_Hash Tables & Hash Functions

ExampleGenerate Hash Codes using Double Hashing for the

following:2227, 3545, 4537, 8981, 7857, 3433, 6965Use Division Method using H1(k) = k mod m and

H2(k) = 1 + (k mod m' )

We have H(k,i)= {H1(k) + i* H2(k)} mod m Use m=11 and m’=9

Page 91: AADS_14_Hash Tables & Hash Functions

StepsFirst generate Hash codes with H1(k) = k mod m

using m=11 Then apply the Second hashing depends on the

Collisions. Take m’=9

Page 92: AADS_14_Hash Tables & Hash Functions

Step 1-Answer2227 mod 11 = 53545 mod 11 = 34537 mod 11 = 58981 mod 11 = 57857 mod 11 = 33433 mod 11 = 16965 mod 11 = 2

Page 93: AADS_14_Hash Tables & Hash Functions

Step 2For resolving collisions, use the second Hash

Function-two times for Hash Code 5 and once for Hash Code 3 and see how the mapping evolves

Page 94: AADS_14_Hash Tables & Hash Functions

Answer-Step 22227 mod 11 = 53545 mod 11 = 34537 mod 11 = 58981 mod 11 = 57857 mod 11 = 33433 mod 11 = 16965 mod 11 = 2

2227 mod 9 +1= 53545 mod 9 +1 = 94537 mod 9 +1 = 28981 mod 9 +1 = 97857 mod 9 +1 = 13433 mod 9 +1 = 56965 mod 9 +1 = 9

Page 95: AADS_14_Hash Tables & Hash Functions

Step 32227 mod 9 +1= 53545 mod 9 +1 = 94537 mod 9 +1 = 28981 mod 9 +1 = 97857 mod 9 +1 = 13433 mod 9 +1 = 56965 mod 9 +1 = 9

2227 53545 34537 5+1*2=78981 5+2*9 17857 3+1*1 43433 1+1*5 66965 2

Page 96: AADS_14_Hash Tables & Hash Functions

Sparse Matrices