CE 221 Data Structures and Algorithms

37
CE 221 Data Structures and Algorithms Chapter 5: Hashing Text: Read Weiss, §5.1- 5.5 1 Izmir University of Economics

description

CE 221 Data Structures and Algorithms. Chapter 5: Hashing T ext : Read Weiss, § 5.1-5.5. Introduction. Search Tree ADT was disccussed. Now, Hash Table ADT Hashing is a technique used for performing insertions and deletions in constant average time. - PowerPoint PPT Presentation

Transcript of CE 221 Data Structures and Algorithms

Page 1: CE 221 Data Structures and Algorithms

CE 221Data Structures and Algorithms

Chapter 5: Hashing

Text: Read Weiss, §5.1-5.5

1Izmir University of Economics

Page 2: CE 221 Data Structures and Algorithms

2

Introduction

• Search Tree ADT was disccussed.• Now, Hash Table ADT• Hashing is a technique used for performing

insertions and deletions in constant average time.

• Thus, findMin, findMax and printAll are not supported.

Izmir University of Economics

Page 3: CE 221 Data Structures and Algorithms

3

Tree Structures

find,insert,delete

worst

caseaverage

case

BST N log N

AVL log N log N

Izmir University of Economics

Page 4: CE 221 Data Structures and Algorithms

4

Goal

• Develop a structure that will allow users to insert / delete / find records in

constant average time

– structure will be a table (relatively small)– table completely contained in memory– implemented by an array– capitalizes on ability to access any element of

the array in constant time

Izmir University of Economics

Page 5: CE 221 Data Structures and Algorithms

5

Hash Function• Determines position of a key in the array.

• Assume table (array) size is N (TableSize).

• Hash function hash(x) maps any key x to an int between 0 and N−1

• For example, assume that N=15, that key x is a non-negative integer between 0 and MAX_INT, and hash function hash(x) = x % 15.

Izmir University of Economics

Page 6: CE 221 Data Structures and Algorithms

6

Choosing the Hash Functions (1)• If keys are integers

key % TableSize ex:all keys=10*i,TableSize=10

So, It’s a good idea to make TableSize a prime• If keys are strings, then ASCII codes of chars

% TableSize, when TableSize=10,007 and

keys are at most 8 chars in length (127*8=1,016)• The base 26+1=27 representation of the first 3

letters of the key

(key[0] +27* key[1] + 729* key[2] ) % TableSize

(263=17,576 but only 2,851 combinations in English)

1

0

][keySize

i

ikey

Izmir University of Economics

Page 7: CE 221 Data Structures and Algorithms

7

Choosing the Hash Functions (2)• Another hash

function involving all characters

• compute this by Horner’s Rule

1

0

37*]1[keySize

i

iikeySizekey

Izmir University of Economics

Page 8: CE 221 Data Structures and Algorithms

8

Hash Function

Let hash(x) = x % 15. Then,if x =25 129 35 2501 47 36 f(x) = 10 9 5 11 2 6

Storing the keys in the array is straightforward:

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14_ _ 47 _ _ 35 36 _ _ 129 25 2501 _ _ _

• Thus, delete and find can be done in O(1), and also insert, except…Izmir University of Economics

Page 9: CE 221 Data Structures and Algorithms

9

Hash FunctionWhat happens when you try to insert: x = 65 ?

x = 65hash(x) = 5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14_ _ 47 _ _ 35 36 _ _ 129 25 2501 _ _ _ 65(?)

• If, when an element is inserted, it hashes to the same value as an already inserted element, this is called a collision.

Izmir University of Economics

Page 10: CE 221 Data Structures and Algorithms

10

Handling Collisions

• Separate Chaining

• Open Addressing– Linear Probing– Quadratic Probing– Double Hashing

Izmir University of Economics

Page 11: CE 221 Data Structures and Algorithms

11

Handling Collisions

Separate Chaining

Izmir University of Economics

Page 12: CE 221 Data Structures and Algorithms

12

Separate Chaining• Keep a list of elements

that hash to the same value

• New elements can be inserted at the front of the list

• ex: x=i2 and hash(x)=x%10

Izmir University of Economics

Page 13: CE 221 Data Structures and Algorithms

13

Performance of Separate Chaining• Load factor of a hash table, λ• λ = N/M (# of elements in the table/TableSize)

• So the average length of list is λ• Search Time = Time to evaluate hash function +

the time to traverse the list

• Unsuccessful search= λ nodes are examined

• Successful search=1 + ½* (N-1)/M (the node searched + half the expected # of other nodes) =1+1/2 *λ

• Observation: Table size is not important but load factor is.

• For separate chaining make λ 1

Izmir University of Economics

Page 14: CE 221 Data Structures and Algorithms

14

Separate Chaining: Disadvantages

• Parts of the array might never be used.

• As chains get longer, search time increases to O(N) in the worst case.

• Constructing new chain nodes is relatively expensive (still constant time, but the constant is high).

• Is there a way to use the “unused” space in the array instead of using chains to make more space?

Izmir University of Economics

Page 15: CE 221 Data Structures and Algorithms

Handling Collisions

Open Addressing

15Izmir University of Economics

Page 16: CE 221 Data Structures and Algorithms

16

Handling CollisionsLinear Probing

An alternative to resolving collisions with linked lists is to try alternative cells until an empty cell is found.

Alternative Cells are h0(x), h1(x), h2(x),... where

hi(x)=(hash(x)+f(i)) % TableSizewith f(0)=0, f is the collision resolution strategy.

Because all the data go inside the tableFor Linear probing, λ < 0.5

Izmir University of Economics

Page 17: CE 221 Data Structures and Algorithms

17

Linear ProbingIn linear probing f(i)=i (Popular Choice)Let key x be stored in element hash(x)=t of the array

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 47 35 36 129 25 2501 65(?)

What do you do in case of a collision?• If the hash table is not full, attempt to store key in

the next array element (in this case (t+1)%N, (t+2)%N, (t+3)%N …)

until you find an empty slot.

Izmir University of Economics

Page 18: CE 221 Data Structures and Algorithms

18

Linear Probing

Where do you store 65 ?

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 47 35 36 65 129 25 2501 attempts

Izmir University of Economics

Page 19: CE 221 Data Structures and Algorithms

19

Linear Probing Performance (1)• If the table is relatively empty, blocks of

occupied cells start forming (primary clustering)

• Expected # of probes• for insertions and unsuccessful searches is

½(1+1/(1-λ)2)

• for successful searches is ½(1+1/(1-λ))

Izmir University of Economics

Page 20: CE 221 Data Structures and Algorithms

20

Linear Probing Performance (2)• Assumptions: If clustering is not a problem, large table,

probes are independent of each other

• Expected # of probes for unsuccessful searches (=expected # of probes until an empty cell is found)

• Expected # of probes for successful searches (=expected # of probes when an element was inserted)

(=expected # of probes for an unsuccessful search)

• Average cost of an insertion

1

1*)1(*

0

1

i

ii (fraction of empty cells = 1 -λ)

1

1

1

1ln

1

1

11)(

0

dxx

I (earlier insertion are cheaper)

Izmir University of Economics

Page 21: CE 221 Data Structures and Algorithms

21

Linear Probing

• Eliminates need for separate data structures (chains), and the cost of constructing nodes.

• Leads to problem of clustering. Elements tend to cluster in dense intervals in the array.

• Search efficiency problem remains.• Deletion becomes trickier….

Izmir University of Economics

Page 22: CE 221 Data Structures and Algorithms

22

Deletion Problem -- SOLUTION• Standard deletion cannot be performed in a probing hash

table, because the cell might have caused a collison to go past it. “Lazy” deletion

• Each cell is in one of 3 possible states:– active – empty– deleted

• For Find or Delete – only stop search when EMPTY state detected (not DELETED)

Izmir University of Economics

Page 23: CE 221 Data Structures and Algorithms

23

Deletion-Aware Algorithms• Insert

– call Find, if cell = active already there– if Cell = deleted or empty insert, cell = active

• Find– cell empty NOT found– cell deleted if key == key -> NOT FOUND

else H = (H + 1) mod TS– cell active if key == key -> FOUND

else H = (H + 1) mod TS• Delete

– call Find, – cell active DELETE; cell=deleted– cell deleted NOT found– cell empty NOT found

Izmir University of Economics

Page 24: CE 221 Data Structures and Algorithms

24

Handling Collisions

Quadratic Probing

Izmir University of Economics

Page 25: CE 221 Data Structures and Algorithms

25

Quadratic Probing

In quadratic probing f(i)=i2 (Popular Choice)Let key x be stored in element hash(x)=t of the array

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 47 35 36 129 25 2501 65(?)

What do you do in case of a collision?• If the hash table is not full, attempt to store key in

array elements (t+12)%N, (t+22)%N, (t+32)%N …until you find an empty slot.

Izmir University of Economics

Page 26: CE 221 Data Structures and Algorithms

26

Quadratic Probing

Where do you store 65 ? hash(65)=t=5

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 47 35 36 129 25 2501 65 t t+1 t+4 t+9

attempts

Izmir University of Economics

Page 27: CE 221 Data Structures and Algorithms

27

Quadratic Probing• Theorem: If quadratic probing is used, and the table size is prime,

then a new element can always be inserted if the table is at least half empty.

• Proof: Let TableSize be an odd prime > 3,

-prove first alternative locations are all distinct.

-Assume two of these locations are the same and . Then,

• Since TableSize is prime and i and j are distinct (also less than floor(TableSize)), this is not possible. It follows that the first M/2 alternative are all distinct, and an insertion must succeed if the table is at least half full.

2/TableSize

ji

)(mod0))((

)(mod0

)(mod

)(mod)()(

22

22

22

TableSizejiji

TableSizeji

TableSizeji

TableSizejxhixh

Izmir University of Economics

Page 28: CE 221 Data Structures and Algorithms

28

Quadratic Probing• Tends to distribute keys better than linear probing,

alleviates the problem of clustering (primary clustering.

• There remains the problem of secondary clustering in which elements that hash to the same position will probe the same alternative cells

• Runs the risk of an infinite loop on insertion, unless precautions are taken.

• E.g., consider inserting the key 16 into a table of size 16, with positions 0, 1, 4 and 9 already occupied.

• Therefore, table size should be prime.

Izmir University of Economics

Page 29: CE 221 Data Structures and Algorithms

29

Handling Collisions

Double Hashing

Izmir University of Economics

Page 30: CE 221 Data Structures and Algorithms

30

Double Hashing

• f(i)=i*hash2(x) is a popular choice• hash2(x)should never evaluate to zero• Now the increment is a function of the key

– The slots visited by the hash function will vary even if the initial slot was the same

– Avoids clustering

• Theoretically interesting, but in practice slower than quadratic probing, because of the need to evaluate a second hash function.

Izmir University of Economics

Page 31: CE 221 Data Structures and Algorithms

31

Double Hashing

• Typical second hash function

hash2(x)=R − ( x % R )

where R is a prime number, R < N

Izmir University of Economics

Page 32: CE 221 Data Structures and Algorithms

32

Double Hashing

Where do you store 99 ? hash(99)=t=9Let hash2(x) = 11 − (x % 11), hash2(99)=d=11Note: R=11, N=15• Attempt to store key in array elements (t+d)%N,

(t+2d)%N, (t+3d)%N …Array: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 14 16 47 35 36 65 129 25 2501 99 29 t+22 t+11 t t+33 attempts

Where would you store: 127?

Izmir University of Economics

Page 33: CE 221 Data Structures and Algorithms

33

Double HashingLet f2(x)= 11 − (x % 11) hash2(127)=d=5

Array: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 14 16 47 35 36 65 129 25 2501 99 29 t+10 t t+5 attempts

Infinite loop!

Izmir University of Economics

Page 34: CE 221 Data Structures and Algorithms

34

Rehashing• If the table gets too full, the running times for the

operations will start taking too long.• When the load factor exceeds a threshold, double

the table size (smallest prime > 2 * old table size).• Rehash each record in the old table into the new

table.• Expensive: O(N) work done in copying.• However, if the threshold is large (e.g., ½), then

we need to rehash only once per O(N) insertions, so the cost is “amortized” constant-time.

Izmir University of Economics

Page 35: CE 221 Data Structures and Algorithms

35

Factors affecting efficiency

– Choice of hash function– Collision resolution strategy– Load Factor

• Hashing offers excellent performance for insertion and retrieval of data.

Izmir University of Economics

Page 36: CE 221 Data Structures and Algorithms

36

Comparison of Hash Table & BST

BST HashTable

Average Speed O(log2N) O(1)Find Min/Max Yes NoItems in a range Yes NoSorted Input Very Bad No problems

• Use HashTable if there is any suspicion of SORTED input & NO ordering information is required.

Izmir University of Economics

Page 37: CE 221 Data Structures and Algorithms

Homework Assignments

• 5.1, 5.2, 5.12, 5.14

• You are requested to study and solve the exercises. Note that these are for you to practice only. You are not to deliver the results to me.

Izmir University of Economics 37