© Love Ekenberg Hashing Love Ekenberg. © Love Ekenberg In General These slides provide an overview...

© Love Ekenberg

Hashing

Love Ekenberg

© Love Ekenberg

In GeneralThese slides provide an overview of different hashing techniques that are used to store data efficiently.

The use of hash tables and hash functions is shown

The three hashing techniques treated here are: separate linking, linear probing, and double hashing.

© Love Ekenberg

Storage

The representation of arrays requires too much space.

Suppose we want to represent all words in Swedish that are shorter than 10 letters.

Because the Swedish alphabet contains 29 letters, we would have to store up to 2910 + 299 + … + 29 words, i.e, all 10-letter words and all 9-letter words, etc.

Each and every word would then be placed in an array.

There are about 250,000 words in Swedish so only a fraction would be meaningful.

© Love Ekenberg

Projection

100 milion words can be projected into a number of boxes

Problem:

There would be some difficulty in addressing the elements. This would become almost as demanding memory-wise as storing the words.

Uneven storage may arise in the boxes.

Solution:

List all the meaningful words in each box. The name of the box then becomes a headinng for the list.

Distribute the words as randomly as possible in the boxes in order to achieve even storage.

© Love Ekenberg

Hash TablesHash tables attempt to find an appropriate path between memory and efficiency requirements. Headings are set for addressing the boxes. Then each box can be searched sequentially. In the example below, there are M boxes, where M is an appropriate number. The number (heading) for an element x is generated from the element x via a hash function - h(x).Heading:012…h(x) element 1 -> element 2 -> …-> element n…M-1

© Love Ekenberg

Hash FunctionsA hash function takes and argument x and generates a value between 0 and M-1, where M is the number of boxes (headings) in the table. The value h(x) is where element x is put. The idea is to combine direct access with searching in a list, but where the list (in the best case) only has 1/M times as many elements as the original set and where the elements are more or less evenly distributed amongst the boxes.

Given 100 words evenly sorted into 10 boxes, then 10, (100/10) elements are in each box.

© Love Ekenberg

Example

Let ORD be a function that yields position in the Swedish alphabet.

If the elements to be stored are words in Swedish the element x can be a1a2…ak, where ai is a letter.

Then f(x) can be defined as ORD(a1) + ORD(a2) + … + ORD(ak).

Lastly let h(x) be the hash function f(x) MOD M.

(x MOD M yields the remainder of x divided by M. For example, 75 MOD 10 = 5 since 75 = 7 10 + 5.)

© Love Ekenberg

Example (cont.)

Store the following string.

anyone lived in a pretty how town

word = array [1..10] of characters

function h(x: word): integer

sum := 0

for i := 1 to 10 do

sum := sum + ORD(x[i])

h := sum MOD M

© Love Ekenberg

Example (cont.)

Word Sum Bucket

anyone 778 3

lived 692 2

in 471 1

a 385 0

pretty 808 3

how 558 3

town 648 3

Here the choice of hash function is important. The example displays a certain uneveness because too many elements come under heading 3.

© Love Ekenberg

Operations on Hash Tables

We can now operate on hash tables in various ways. Common such operations are:

– inserting elements

– deleting elements

– checking elements (Look up)

The algorithm for performing one of these operations is:

1. Calculate h(x).

2. Use the array of pointers to find the list h(x) of elements.

3. Carry out the operation.

© Love Ekenberg

Example

Proc bucketInsert(x:Word; L:List) Proc inserts element x

if L = NIL then NIL is the end of the list

new(L) Here a new element is defined in list L

L.element := x the element is x

L.next := NIL the element after x is not found so the list ends here

else if L.element <>x then

bucketInsert(x, L.next) If the element x differs form the current element in the list then the procedure is called again with the next element in the list.

Suppose we want to delete ‘pretty’. Calculate h(pretty) which is 3. The second cell contains pretty and is deleted.

© Love Ekenberg

Complexity of Operations on Hash Tables

Finding the hash number is O(1)

Naturally this assumes the hash function is not too complicated

Furthermore O(N/M) is required, where N is the number of elements stored.

This holds since, for example, insertBucket requires time proportional to the number of elements in the list, which on average is the total number of elements N divided by the number of boxes M.

© Love Ekenberg

Separate linking

The technique described here is called separate linking. Separate linking is a technique which divides a number of elements in boxes within these boxes the elements are sequentially linked to each other.

It should now be easy to accept the following theorem.

Theorem

Separate linking reduces the number of comparisons for a sequential search by a factor of M (on average).

© Love Ekenberg

Some Observations

Let N be the total number of elements and M be the number of headings.

If N and M are close then the result is about O(1).

If M > N then O(1) still holds if at most one element is sorted under each heading. It is therefore pointless to extend the table.

If N is much larger than M, then a larger M can (and should) be chosen and all the elements moved to the new table. This takes time O(N), but this is no longer than the time it takes to insert the elements into the original table O(N*O(1)).

© Love Ekenberg

Linear Probing

If the number of elements in the table can be assessed in advance, then M > N can be chosen and so called ‘open addressing methods’ used. This means that we know that there is room for an element in each box and therefore do not need linked lists. The advantage of this is direct access to the elements, never requiring a search through the linked lists.

A suitable technique in this case is linear probing.

If a collision occurs then the next box is used

If there is free space: insert (or delete, or check) the element and finish

Otherwise continue

© Love Ekenberg

Example

Let M = 19.

Sort the string ASEARCHINGEXAMPLE using the hash function ORD(x) MOD 19 as below.

A S E A R C H I N G E X A M P L E

1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5

Clearly several elements come under the same heading, which should be avoided. When such collisions occur a simple trick is to move the element to the next available space, i.e, test the next box. If there is an element there then test the next box etc. Continue in this way until an empty box is found.

© Love Ekenberg

Example (cont.)

The first collision occurs when trying to place the second A, i.e, upon reaching ASEA. The hash function prescribes sorting it under heading 1.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S A E

However, heading 1is taken and since there are no elements under heading 2, the A can be put under there.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S A A E

© Love Ekenberg

Example (cont.)

Continuing like this will gradually yíeld the following table.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S A A C E G H I N R

The next element is a new E. Heading 5 is taken, but heading 6 is free. So E can be put under heading 6.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S A A C E E G H I N R

The next element is X. The hashing function ORD(x) MOD 19 projects X onto 5, which is taken so the algorithm tries 6, which is also taken so it then tries 7 which is also taken. Continuing in this way, finally 10 is found to be free and X is placed there.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S A A C E E G H I X N R

© Love Ekenberg

Theorem

The following holds but need not be proved.

Linear probing uses 1/2 + 1/2(1 - N/M)^2 operations in the worst case and 1/2 + 1/2(1 - N/M) on average.

© Love Ekenberg

Double Hashing

As can be seen from the example above, linear probing is inefficient when nearby boxes begin to fill up. This is termed clustering. An alternative is double hashing.

Double hashing is used to avoid clustering, and uses a function h2(v) to shunt elements along.

Instead of moving one step ((h1( x) + 1) MOD M) as in linear probing, h1( x) + h2(h1(x)) MOD M steps are moved, where h1(x) is the first hash function.

A good function is h2 (h1(x)) = M - 2 - (h1(x)) MOD (M-2)

Another is h2 (h1(x)) = 8 - ((h1(x)) MOD 8). (See the example below)

© Love Ekenberg

Example

The table below shows the projections of the functions h1 and h2.

A S E A R C H I N G E X A M P L E

1 0 5 1 18 3 8 9 14 7 5 5 1 13 16 12 5 h1

7 5 3 7 6 5 8 7 2 1 3 8 7 3 8 4 3 h2

When a collison occurs the first the square to be examined is that at position x + h2(h1(x)) MOD 19, where h2 (h1(x)) = 8 - ((h1(x)) MOD 8).

For example: h2 (h1(A)) = h2 (1) = 8 - (1 MOD 8) = 8 - 1 = 7. h2 (h1(P)) = h2

(16) = 8 - (16 MOD 8) = 8.

© Love Ekenberg

Example (cont.)The first collision occurs upon trying to insert the second A, i.e, upon arriving at ASEA. The hash function prescribes placing it under heading 1.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S A E

h2 (h1(A)) = h2 (1) = 8 - (1 MOD 8) = 8 - 1 = 7. 1 + 7 = 8 and since there no elements under heading 8, the new A is put there.

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

S A E A

In this way the elements can be spread out using both functions.

© Love Ekenberg

The Choice of FunctionNaturally h2(x) should be chosen wisely. For example neither M nor h2(x) should be divisors of the other.

Example:

Let M be 10 and h2(x) = x MOD 6. Now try to sort the string EEEE. ORD(E) = 5, h2(5) = 5.

The first E is put under heading 1. Because ORD(E) = 5 och h2(5) = 5, the second E comes under heading 5+5 = 10. Since 0 = 10 MOD 10, E comes under heading 0. The third E comes under 0+5 = 5. This heading is taken so E is sent on to heading 5+5 MOD 10 = 0. With this heading also taken, E is sent to 5 again. The algorithm has returned without h2 able to find a free space, in spite of several available spaces remaining. Similar behaviour occurs for any function that is a divisor of M.

0 1 2 3 4 5 6 7 8 9E E

© Love Ekenberg

Theorem

The following holds but need not be proved.

Double hashing uses 1/(1 - N/M) operations in the worst case and ln(1 - N/M)/(N/M) on average.

© Love Ekenberg Hashing Love Ekenberg. © Love Ekenberg In General These slides provide an overview...

Documents

Transcript of © Love Ekenberg Hashing Love Ekenberg. © Love Ekenberg In General These slides provide an overview...