AADS_14_Hash Tables & Hash Functions

Post on 28-Apr-2015

220 views 2 download

Transcript of AADS_14_Hash Tables & Hash Functions

Hash Tables & Hash Functions

AADS-14

Significance of complexity of SearchIn an unordered list the time required to find

a value is O(n)In an ordered list this time can be improved,

and there could definitely be improvement in the modification operations

In a Binary Search Tree the search time could well improve to O(log n)

Same is the limit for AVL trees

Dictionary Data StructureDictionary is a general form of Data Structure to

store key and valuesIt can be implemented using Array or Linked List

structuresFor a Dictionary, the direct addressing of each

element could be done using the value of the element as index, if the Dictionary is of that size

Key Search/Dictionary StorageBut in any of the complex applications the memory is

simultaneously used by many processesAlso there could be frequent accesses to the Keys in

the runtimeSo, there is a need for reducing both size of space

and the search time

Example-1A 4 digit number as Key may need 9999 locations If the Key stands for the Employee ID of a company

with 500 employees, Then only 500 locations shall be used when all the

Keys are arranged in the memory

Example 2A Hospital might be having large number of patients,

both inpatients and outpatientsThe database system can be modeled to group the

patients and then index them so that the retrieval of the records shall be fast.

Another way is not to group, but assign only one number to each case

Search time of O(1)In both cases of large data or small amount of data,

the amortized time of O(1) or a near about time could be achieved if we know the location of the data or key we are looking for

This location could be obtained from a mapping of the key to a new hashed key using proper functions

Hashing Hashing could provide unique locations or a

reference to a shorter list for the keys from where we can easily get the data pertaining to one key

Also, this would perhaps use less space in memory Instead of a large array, we can use a short length

array/linked list

Hash TableHash Table is a Data StructureHash tables provide the time O(1) for any

and all values in a set contained on the Hash Table for search/insert/delete

Hash Table?Hash table is an array say T[1,m] where m is a

positive integer called the table sizeWhen we try to put an item into a spot in the

hash table that is occupied, the situation is called collision

It is resolved using a collision resolution policy

Hashing-Mathematical DefinitionHashing is a mapping operationConsider the a set K of keysLet H be a function that map the keys to a new set LSuch that

H:K L

Hash Function/ & Hash AddressThe function H is called the HASH FUNCTION

This mapping done by the function H is called the HASHING

The object L is the Hash table

Each cell/location in L is identified using the Hash address

Hash AddressLet k is Key in K or k KThen k will have a mapped address in L given by

H(k) known as the Hash AddressHash Address d is the mapped address/location

given by the hashing operation

d=H(k)

of a key k

Indexing on the Hash table The hash address d shall directly point to a location

in LThis address d is also called the Hash Address or

Hash Code for the key kThe process of Hashing is also called Compression

NotesThere is no meaning between the actual data value

k and the hash key dSo there is no practical way to traverse a hash table,

except a direct search using dHash table items are not in any orderThere is no mapping function from d to k, except the

hash tableThe purpose of hash tables is to provide fast look

ups

Illustration- Bucket Array Structure for Hash Table

1k1

2k2

3k3

L-1kN-1

LkN

Uses of Hash TablesCompilers use hash tables for symbol storage.The Linux Kernel uses hash tables to manage

memory pages and buffers.High speed routing tables use hash tables.Database systems use hash tables.

Operations on Hash TablesInitializeInsert(k)Search(k)Remove(k)SizeofIsempty

Types of hashingThere are two types

1. Open hashing- Open Chaining-Closed Addressing-Separate Chaining

2. Closed hashing- Open Addressing

Open hashing-Open ChainingAmount of data to be stored is highUses a hash function to obtain the hash addressAll data with same hash address shall be stored as a

shorter list with a reference indicated by the above hash address

Bucket in Open hashingEach hash location on the Hash table is said to a

bucket for the data with an indexData within the bucket could better be organized as

Linked List

1k1

2k2

3k3

L-1kN-1

LkN

Closed hashing-Open AddressingClosed hashing uses a fixed spaceHashing shall map a key into one of the locations in

the earmarked spaceIf there are multiple keys getting hashed to same

address(collision) then the tie shall be resolvedBucket may be small enough to hold only one value

at a time

Topics in HashingBasically there are two subareas under “Hashing”

1. Hash Functions

2. Collision Resolutions

Hash Functions

1. The Hash Function H should be easy to compute

2. The function H should, as far as possible, uniformly distribute the hash addresses throughout the set L so that there are a minimum number of collisions

Hash Functions

Requirement of Hash FunctionsThe main idea of using Hash Function H is that for a

key k, the hash function H obtains a value H(k) as an index into the hash table cell/bucket so that we can locate the key k in the Hash Table easily for search/insert

Hash FunctionsDivision MethodMid Square methodMultiplication Method

Division MethodChoose a prime number that is not close to the

power of 2Let m be the selected numberThen m also indicate the size of the Hash Table in

the ideal case with one cell in each bucketThe hash address/bucket address is given by

H(k)=k mod m

ExampleGiven keys are

4845, 5679, 6381, 3636, 7180, 8126, 1127

Use Table size m=7

Hash to a Table with 7 cells

Also use m=11

and m=8 to repeat the exercise

Answer

01127

14845

25679

33636

46381

57180

68126

HASH ADDRESS

KEY

Choosing Table size in Division MethodWhen using the division method, ample

consideration must be given to the size of the table.

The best choice for table size is usually a prime number not too close to a power of 2.

Division Method for Chaining-

Here, the Hash Table will have many cellsHash addresses map multiple keys to a single location, So, there could be multiple entries in one location, These multiple entries under a single hash Code are

held as a linked list

IllustrationTake Table size m as 11 to map a set keysKeys –

Modulo Divide each by 11 and get the hash addresses

122 221 661

90 167 57

69

Answer- We get the following Table

1 111 221 551

2 90 167 57

3 69

0

4

Load FactorLet there are m slots in a Hash TableAt the instant of observation the number

elements is nTherefore the Load factor =n/m This is the average number of element stored

in the Hash Table can be less than, equal to or greater than 1

Find the Load Factor 0 110

1 89 452 68

167 57

34 225 554

9 108

5 82

10 109

SolutionThere are 11 slots11 elements = 11/11=1So, indicates the average number of elements per

positionAlso, we get =1 even if there are vacant slots,

because it is only showing the average

Notes on The Load factor could be assuming various values

as the number of keys on the Hash Table changesAccordingly, could be less than, equal, or greater

than one in a Hash Table formed using Separate Chaining(Open Hashing)

In a Hash Table formed using Open Addressing(Closed Hashing) shall be always less than one

decides the complexity of the operations on the Hash Tables like insert, search, delete etc

Hashing the Strings

ExerciseMap the following keys in such a way that we have

the hash function as followsFind the ASCII values of first and last charactersIf there is only one character, it shall be the start and

endAdd the ASCII value of last character to the ASCII

value of first multiplied by 256Apply mod m division to this resulting number

KeysA, BABU, CHOWHAN, SUMAN, DILIP

The 5 symbols are:AA BUCNSNDP

These 5 symbols are then converted to a numerical code using the rule given previously by employing the ASCII values of the characters in the symbols

ASCII ValuesA-65B-66C-67D-68E-69F-70G-71H-72I-73

J-74K-75L-76M-77N-78O-79P-80Q-81R-82

S-83T-84U-85V-86W-87X-88Y-89Z-90

Example- AnswerAA 256*65+65=16705BU 256*66+85=16981CN 256*67+78=17320SN 256* 83+78=21326DP 256*68+ 80=17488

A-65B-66C-67D-68E-69F-70G-71H-72 I-73

J-74K-75L-76M-77N-78O-79P-80Q-81R-82

S-83T-84U-85V-86W-87X-88Y-89Z-90

SolutionTake m=7Obtain the Hash Addresses

AA 256*65+65=16705mod 7=3BU 256*66+85=16981mod7=6CN 256*67+78=17320mod7=2SN 256* 83+78=21326mod7=4DP 256*68+ 80=17488mod7=2

Solution

1

2 CHOWHAN DILIP

3 AA

4 SUMAN

0

5

6 BABU

Symbol TableCompilers use a method similar to the previous one

to form a symbol table for the parsing purposes in the compilation

Hash Functions for string hashingHash Functions perform two separate functions:

1 – Convert the string to a key.

2 – Constrain the key to a positive value less than the size of the table.

The best strategy is to keep the two functions separate so that there is only one part to change if the size of the table changes.

Notes-Chaining methodThe chaining method gives infinite space in the hash

table in principleBut, in practical applications, only limited space shall

be allotted for one hash table in the memoryThere is no collision in chaining

Collisions

CollisionIn the case of closed hashing(open addressing)-

even though H is ideally giving distinct addresses in L for each member in K in the real situation two or more Keys may LEAD TO A SINGLE Hash Address when a given Hash Function is used

This situation is called collisionWe need some method to resolve collisionThe method is called “Collision Resolution Policy”

Collision Resolution PolicyLinear ProbingQuadratic ProbingDouble Hashing

Linear ProbingIf a collision occurs, look for next immediate free

location and use it for storage for the insert operationIf a key is not found, look for it in the next cells in a

linear manner for search operations

ExampleLet H is mod 11Let the keys are 56, 78, 100 appear in this order for

hashingAll these have home as position 1The table is considered a circular array

0 156

278

3100

8 9 10

4

ExerciseHash 45, 39, 66, 74 in that order with Table size m=7

0 1 2 345

566

674

439

45 mod 7=339 mod 7 = 466 mod 7 =374 mod 7=4

ExerciseLet H is mod 11Let the keys are 46, 122, 222, 441 appear in this order

for hashing

46 mod 11 = 2122 mod 11 = 1222 mod 11 = 2441 mod 11 = 1

Solution

0 1122

246

3222

8 9 10

4441

More on Hash Functions

Mid Square Method of hashing

Mid square method1. The key k is squared to get k2

2. This value is now treated as a string of digits

3. Then hash function H(k) is defined as H(k)=f

4. This f is given by deleting the digits from both ends of k2

5. Once chosen, same positions of k2 must be used for all keys consistently

Examplek: 3205 7148 2345k2 : 10 272 025 51 093 904 5 499 025H(k) 72 93 99

Multiplication Method for hashing

Multiplication method for HashingThis method uses a hashing which is different from

the Division methodThe function take the form

H(k)=m(kA mod 1)=floor(m* (kA mod 1)

Where, 0<A<1 and kA mod 1 refers to the fractional part of kA

Since 0< kA mod 1<1, the range of H(k) is from 0 to m

Advantage of Multiplication MethodThe advantage of the multiplication method is that it

works equally well with any size mA should be chosen carefullyRational numbers should not chosen for AAn example of good choice for A is

2

15

Obtain the Hash Codes for the keys2343, 4345, 6567, 3476, 1215m=11, A=0.618

2343 floor(11* (2343* 0.618 mod 1) 10

4345 floor(11* (4345* 0.618 mod 1) 26567 floor(11* (6567* 0.618 mod 1) 43476 floor(11* (3476* 0.618 mod 1) 11215 floor(11* (1215* 0.618 mod 1) 9MATLAB command – floor(11*mod((k*0.618),1))

2

15 A

Solution

0 13476

24345

3

8 91215

102343

46567

More on Collision Resolution

Quadratic Probing for Collision Resolution

Notes on Linear ProbingLinear probing is simple to programLinear probing has better locality of reference and

hence better cache performance in the memory usage

Primary Clustering in Linear ProbingLinear probing use a probe sequence H+1, H+2,

H+3 and so on to find the space of the key, which has got the primary hash value as H

This would lead to clustering of hash codes near some cells, called primary clustering

Larger the cluster, lesser will be the search efficiency

Uniform Hashing & Random ProbingIf use a method to generate Hash codes in a

uniformly distributed manner with a larger table size the process may avoid collisions

Even if collisions occur we may use a pseudo random sequence to probe the locations

But this approach reduces the locality reference, which then becomes a random variable

So, better to use a via media solution between the linear probing and the random hashing

Quadratic ProbingInstead of linearly traversing through the hash table

slots in the case of collisions, the quadratic probing introduces more spacing between the slots we try in the case of collision

This reduces the clustering effect seen in linear probing

Clustering can still occur because Quadratic Probing is not immune to clustering

Quadratic Probing preserves some locality reference and hence give good cache performance but lower than that of Linear Probing

Hash Function for quadratic probingH(k,i)=(H’(k)+c1*i + c2 i2 ) mod m

Where c1 and c2 are constants, (auxiliary constants)H’ is an auxiliary hash function. It could be k mod mi=0,1,2,…,m-1 is called the probe numberFor a given Hash table the c1 and c2 remain

constantChoices for c1 and c2 are c1 = c2 =½, c1 = c2 =1, c1

= 0, c2 =1,

Examplec1 = c2 =½,Take m= 11Let the keys are 46, 122, 222, 441 appear in this

order for hashing

46 mod 11 = 2122 mod 11 = 1222 mod 11 = 2 (2+0.5 *1 + 0.5*1) mod 11441 mod 11 = 1 (1+0.5 *1 + 0.5*1) mod 11

ExerciseApply Quadratic Probing for the following Hash

Addresses78 mod 11 =189 mod 11 =1111 mod 11=1166 mod 11=1

Answer78 mod 11 =1 189 mod 11 =1 (1+0.5 *1 + 0.5*12 ) mod 11 2111 mod 11=1 (1+0.5 *2 + 0.5*22 ) mod 11 4166 mod 11=1 (1+0.5 *3 + 0.5*32 ) mod 11 7

NotesIf two keys have the same initial probe position, then

their probe sequences are the same, since H(k1, 0)=H(k2, 0) implies H(k1, i)=H(k2, i)

This property leads to milder form of clustering called secondary clustering

Clustering

Problems with Linear ProbingLinear probing leads to Primary Clustering- the

hashed keys share substantial segments of probe sequence, because more than one key hashed into same home position shall have the same probe sequence

And the hash addresses that collide at the home address, say b, will extend the cluster

Primary ClusteringAs we have seen, once a block of few contiguous

occupied positions emerges in the Hash Table, it becomes a “target” for subsequent collisions

As clusters grow, they also merge to form larger clusters

Primary clustering means – elements that hash to different cells probe same alternative cells

Clustering will be reduced only if the hash addresses home at different positions

ExampleSuppose we have 10 Hash Codes with value 1 and

5 Hash Codes with Value 2All these codes shall be clustering around 1 and 2

Problems with Quadratic ProbingThere could be adjacent clusters that join to form

composite clustersThis is called secondary clustering

This happens because the keys which have the same home hash address, will lead to same probe sequence

In Quadratic probing also, the probe sequence is a function of the home position and not the original key value

Double hashing for Collision Resolution

Double HashingTo avoid secondary clustering, we need to have the

probe sequence that make use of the original key value in its decision process

This is achieved using Double Hashing, because the Hashing is done in two stages

We shall use a second hash function also, so as to reduce the collisions

Double HashingLet H1(k) and H2(k) be two hash functions for the

same key kThe H(k) is obtained as

H(k,i)= {H1(k) + i* H2(k)} mod m for the ith probe sequence

If the Table size m is a prime number the above sequence is likely to access all locations in the Hash Table

NotesThe functions H1(k) and H2(k) are auxiliary hash

functions, which are selected like any hash function: so that the Keys are distributed in a uniform and random manner.

Example 1We let H1(k) = k mod m and H2(k) = 1 + (k mod m' ),

where m' is slightly less than m, say, m – 1 or m – 2.For example m=11 and m’=9

Example 2First Use Mid Square Method and then use the

Modulo Division

Double hashingDouble hashing can be used to avoid the primary and

secondary clusteringH2(k) must be chosen with care

m and H2(k) must be relatively prime and this can be effected by making m a prime number

If m is a power of two then choose H2(k) which is always odd

ExampleGenerate Hash Codes using Double Hashing for the

following:2227, 3545, 4537, 8981, 7857, 3433, 6965Use Division Method using H1(k) = k mod m and

H2(k) = 1 + (k mod m' )

We have H(k,i)= {H1(k) + i* H2(k)} mod m Use m=11 and m’=9

StepsFirst generate Hash codes with H1(k) = k mod m

using m=11 Then apply the Second hashing depends on the

Collisions. Take m’=9

Step 1-Answer2227 mod 11 = 53545 mod 11 = 34537 mod 11 = 58981 mod 11 = 57857 mod 11 = 33433 mod 11 = 16965 mod 11 = 2

Step 2For resolving collisions, use the second Hash

Function-two times for Hash Code 5 and once for Hash Code 3 and see how the mapping evolves

Answer-Step 22227 mod 11 = 53545 mod 11 = 34537 mod 11 = 58981 mod 11 = 57857 mod 11 = 33433 mod 11 = 16965 mod 11 = 2

2227 mod 9 +1= 53545 mod 9 +1 = 94537 mod 9 +1 = 28981 mod 9 +1 = 97857 mod 9 +1 = 13433 mod 9 +1 = 56965 mod 9 +1 = 9

Step 32227 mod 9 +1= 53545 mod 9 +1 = 94537 mod 9 +1 = 28981 mod 9 +1 = 97857 mod 9 +1 = 13433 mod 9 +1 = 56965 mod 9 +1 = 9

2227 53545 34537 5+1*2=78981 5+2*9 17857 3+1*1 43433 1+1*5 66965 2

Sparse Matrices