91.102 - Computing II Tables and Hashing. We continue our quest for fast searching and insertion...

91.102 - Computing II

Tables and Hashing.

We continue our quest for fast searching and insertion methods.

We observe that indexing in an array has O(1) time complexity - at least for retrieval. The same time complexity holds if we attempt to insert into an EMPTY array position.

Array positions are indexed by integers.

The problem is that most convenient keys are NOT numeric.

Convenient Keys:

Social Security Numbers: numeric, but cover too wide a range (9 decimal digits) except for a major company or the government.

(Last Name, First Name, Middle Initial) triples: not numeric.

Addresses: not numeric (only house number is numeric).

Dictionary entries: not numeric. Etc.


Let’s not worry, for a moment, about indexing mechanisms.

What is that we want?

We want to store a “chunk” of information, say I, under a much smaller “key”, say K. What we are really talking about is pairs (K, I) and the manipulation thereof.

In the case of indexing into a standard array, K takes only integer values (or the values of an enumerated type), in the general case it can take values in a much larger domain.

We call such collections of pairs, with an appropriate set of operations, TABLES.


The TABLE Abstract Data Type: a table entry is either empty or is a pair of the form (K, I), where K is a key, and I is some information associated with K.

Operations:

1) Initialize the table T to be the empty table. This table is filled with "empty" table entries, that is pairs (K0, I0) where K0 is a special empty key.

2) Determine whether T is empty or full.

3) Insert a new table entry (K, I) into a non-full table.

4) Delete a table entry (K, I) from the table,

5) Given the search key K, retrieve the information I of the pair (K, I).


Two other operations:

6) Update the table entry (K, I) by replacing (K, I) with another pair (K, I’), where K is the same as before.

7) Enumerate the table entries in increasing order of their keys.

It should be clear that the last two operations could be implemented in terms of the others and, possibly, some other (i.e., sorting) mechanisms.


Problem: How do we implement this? How do we use keys that are non-numeric to index information into an array?

Solution: change the keys to numeric values.

How?


There are three problems that have to be overcome:

1) Given any sensible size domain of (key, information) pairs, the range of numbers generated must be sensible in size. For example, expecting to handle about 10,000 pairs should result in expecting numbers in a range from 0 to “not too much” above 10,000.

2) Different keys should give rise to different index values.

3) The mapping Key -> index_value should be FAST.

As most real-life problems, this one has no neat solution.


How do we construct this transformation?

Most keys are already converted to numerical information: the characters have ASCII values and ASCII values are numbers between 0 and 255:

Amanda = 65.109.97.110.100.97 =

01000001.01101101.01100001.01101110.01100100.01100001

And you can drop the dots to make a single numeral.

This is, obviously, too large a number to be seriously considered as a potential index. What do we do?

How big an index do we want? 12 bits? - 4096 entries?


Easy solution: take the leading twelve bits of the representation.

Downside: all names with the same first letter and about half of the “second letters” all give the same result(a-o: 0110; p-z: 0111)… Too many keys give the same index.

Actually, the problem is that too many “nearby” keys give the same index.


A slightly more complex solution: break the raw “numeral” of the key into 12 bit segments, and perform some operation that uses all the bits - for example, add them up (mod 4096) in 12 bit slices. This is a little better. It is still fast (but less so), mixes things up a little more, gives a reasonable size number, etc.

There are more complicated functions, which are better at separating “nearby” keys, but this is the general idea.


Achieved up to this point:

1) We can convert the key to some kind of “unique” numerical value that is passed to a special purpose function.

2) We can control the size of the resulting index set.

3) We can control the speed with which we compute the index associated with a key.

4) We can’t really control the result of the index computation: two keys can give the same index… This is called a collision and can’t be avoided - unfortunately.

How do we handle collisions? In several different ways, depending on the application.


When you compile your program, the compiler generates a “symbol table” with all the symbols that were used in your program.

The language you use has “reserved words” (if, then, do, while, for, …) and the compiler writer constructs a function - by hand - that will generate distinct indices for each and every reserved word. This will minimize the amount of work our compiler has to do when it searches the symbol table to figure out what you’re talking about.

The rest of the symbols matter less, EXCEPT that we tend to use similar words to denote similar entities: many of our symbols will be small variants one of the other - so, if you are a compiler writer, try to cook up a function that “separates” symbols with near prefixes.


I kept giving “solutions” that pushed the collision problem to a “later slide”. This is that “later slide”.

How do we handle collisions?

By using a systematic way of trying other locations in the table.

The scheme has to be systematic (= repeatable) because you have to be able to recover the (K, I) pair you stored - and the storage and retrieval have to remain as close to O(1) in time cost as you can make them: after all, this is the whole point of the exercise.


Example: the upper case ASCII codes:

A = 1000001, B = 1000010, C = 1000011, D = 1000100,

E = 1000101, F = 1000110, G = 1000111, H = 1001000,

I = 1001001, J = 1001010, K = 1001011, L = 1001100,

M = 1001101, N = 1001110, O = 1001111, P = 1010000,

Q = 1010001, R = 1010010, S = 1010011, T = 1010100,

U = 1010101, V = 1010110, W = 1010111, X = 1011000,

Y = 1011001, Z = 1011010.


Let’s say we use the alphabet as keys, and we expect 7 or so items to be stored, retrieved and manipulated. To add some margin of safety, we will have a table with 8 positions, indexed from 0 to 7.

000 001 010 011 100 101 110 111

Since we don’t know which 7 letters we can expect, there is no way to choose a “perfect” function. If we act without thinking we might say: use the first three binary digits and be done.

The first three digits are 100 or 101 for all 26 of them.

A, P, B, Q, D, S, F will all go either to the position with index 100 (binary) or 101(binary).


000001010011100A

101110111

000001010011100A

101P

110111

000001010011100A

101P

110111B ?

Where do we put the B? It should go in position 100, but that is already occupied.

Systematic solution: keep moving towards “increasing index” - modulo 8 - until you find an empty position.

000001010011100A

101P

110B

111


000C

001R

010D

011100A

101P

110B

111Q

Unfortunately, we seem to have lost the O(1) requirement, since inserting (or finding) D will require checking nearly every position in the table… and all others (after A and P are inserted) will require a number of “checks” (the correct term is probes) proportional to the number of items already in the table.

Finally, after everybody has been inserted:


What would happen if we used the last three binary digits?

A 001, B 010, C 011, D 100,

P 000, Q 001, R 010;

A, P, B, Q, C, R, D:000001

A010011100101110111

000P

001A

010011100101110111

000P

001A

010B

011100101110111

000P

001A

010B

011100101110111Q 001


000P

001A

010B

011Q

100101110111

This took a total of 3 probes: 001, 010 and 011.

Still to go: C 011, R 010, D 100.

C: 2 probes; R: 4 probes; D: 3 probes.

000P

001A

010B

011Q

100C

101R

110D

111

This is better - although not quite as neat as we would like it.


Words of wisdom:

A) Choose your “indexing function” (HASH function) carefully, using the information you have about the distribution of the incoming data.

B) Linear probing - the method just introduced - does not work very well, since it tends to build long runs of filled table positions. Collisions require lots of reprobing to be resolved.

C) Be prepared to “waste some space” - if the table gets too close to filled, you’re in trouble: the number of probes to find an empty space will go up, and fast.


Choosing a Table Size and a Hash Function.

For a number of reasons, the prevalence of powers-of-two in the representation of data (everything is in bytes - 23 = 8 positions capable of holding 28 pieces of information) makes a table whose size is a power of two undesirable: it is just hard to construct a hash function that will “spread the keys” in a fairly uniform manner, when all the keys are represented in terms of powers of two.

It turns out that a better (= less troublesome) size for a table is given by a prime number close to a power of two.


0 1 2 3 4 5 6 7 8 9 10

Table size = 11(decimal).

Choose the hash function h: ASCII_code mod 11

A = 65, P = 80, B = 66, Q = 81, C = 67, R = 82, D = 68;

65%11 = 10, 80%11 = 3, 66%11 = 0, 81%11 = 4, 67%11 = 1, 82%11 = 5; 68%11 = 2.

0B

1C

2D

3P

4Q

5R

6 7 8 9 10A

One probe for each insertion!!! Not bad. On the other hand, if we were now to insert L = 76 = 10 mod 11, we would need 8 probes before we found an empty table location… Oh, well...


This is the difficulty with linear probing - moving in one direction by one position every time we have a collision. This mechanism tends to build “long runs” of occupied positions, which are ever more likely to get further extended.

We need a way to “shake up” the process (making probing non-linear) - and many such ways have been proposed and used. We will now examine one of them.


Double Hashing: add a second hashing function, say p, with the requirement that p(xx) ≠ 0 for every key xx.

If h(xx) = h(yy), then we should have p(xx) ≠ p(yy).

Index for xx: h(xx) + p(xx) % size of table;

Index for yy: h(yy) + p(yy) % size of table.

The two indices have a reasonable likelihood of being different.


If h(yy) + p(yy) takes you to a location that is already occupied, what do you do?

Try h(yy) + 2*p(yy) % size of table,

h(yy) + 3*p(yy) % size of table, etc…

And this should tell you another reason why you want the size of the table to be a prime number: the sequence

{h(yy) + n*p(yy) | n = 0, 1, …, size_of_table - 1}

where the arithmetic is % size of table, will cover the whole table - if there is a free position, you WILL find it.


What’s the advantage:

A) the first retry is likely to be fairly far from the collision - hopefully the new region will be more sparsely populated.

B) by moving far from the collision - with “long jumps”, and each key having different length jumps - you are less likely to build up “long contiguous runs”.


0B

1C

2D

3P

4Q

5R

6 7 8 9 10A

Recall h: ASCII_code mod 11

A = 65, P = 80, B = 66, Q = 81, C = 67, R = 82, D = 68;

65%11 = 10, 80%11 = 3, 66%11 = 0, 81%11 = 4, 67%11 = 1, 82%11 = 5; 68%11 = 2.


p = ? Choose p : ASCII_code mod 7 + 1. No particular reason, other than it will generate numbers mostly different from those of h, and it will not generate 0 - which would not get us moving.

h(L) = 76 % 11 = 10 - collision with A.

p(L) = 76 % 7 + 1 = 6 + 1 = 7.

(10 + 7) % 11 = 6 - Done… we were lucky..

0B

1C

2D

3P

4Q

5R

6 7 8 9 10A


0B

1C

2D

3P

4Q

5R

6L

7 8 9 10A

1 collision - 2 probes instead of 8. At least some improvement.

It should be clear that there is NO perfect solution - the only thing we can strive for is finding hash functions that will give “acceptable” performance - with no guarantee that a particularly “evil” input sequence will not wreck our plans.


At least intuitively, it should be clear that a nearly filled table will give rise to many collisions and long probe sequences. If we know that our table will be nearly full a fair amount of the time, can we do something else to minimize the number of probes?

As long as our initial hash function is good at spreading the keys uniformly over the index range, the answer is: well… yes.

How?

Collision Resolution with Chaining...


0

B

1

C

2

D

3

P

4

Q

5

R

6 7 8 9 10

A

h(L) = 76 % 11 = 10 - collision with A.

0

B

1

C

2

D

3

P

4

Q

5

R

6 7 8 9 10

L

AInsertion: one collision, one new node, reset two pointers.

Retrieval: chase down the chain until you find the right node.


Our second example: 15 probes.

A 001, B 010, C 011, D 100,

P 000, Q 001, R 010;

A, P, B, Q, C, R, D:

000

P

001010011100101110111

Q R

BA

C D

Seven probes, two collisions, two extensions to the chains.


Are we getting so many collisions because we are stupid, or what?

Actually, no… they are quite likely, as the “birthday problem” shows.

There are 365 possible birthdays in a year.

Question: How many people do you have to collect in the same room before the probability that two will have the same birthday will exceed 0.5?

Answer: 23.

Since the “load factor” is defined as #occupied/range_of_index, we get a probability of having had a collision to be over 50% at a load factor of 23/365 = 0.063 = 6.3%


Can we show this?

P(n) = probability that among n people two or more share a birthday.

Q(n) = 1 - P(n) = probability that n people all have different birthdays.

Q(1) = probability that 1 person will not share a birthday = 365/365 = 1: the person can have any birthday and not find anybody else there.

Q(2) = Q(1)*(364/365): after the first one, there are only 364 empty positions.

Q(3) = Q(2)*(362/365)


Q(n) = Q(n - 1)*((365 - n + 1)/365)

= (365/365)*(364/365)*…*((365-n+1)/365)

= 365!/((365 - n)!*365n)

P(n) = 1 - 365!/((365 - n)!*365n)


P(n)

As the load factor goes up, the probability of collision goes up, which means that the collision resolution strategy becomes important.

A) Linear probing is not used much because it generates long runs of contiguous, occupied table positions. This is not hard to see: if a key is presented, and we assume it is random, it has the same probability to hash to any index. A long cluster will thus have a larger probability of capturing the key than a short cluster, so long clusters have a higher probability of getting longer - since you resolve the collision by adding the incoming item at the end of the cluster - than short ones, thus increasing the probability of a hit on the same cluster on the next pass.


B) Rehashing - with jumps - will generate smaller clusters, since the next try (after a failure) will be more or less random.

C) Chains and buckets will generate short(?) chains… you hope (the worst case, though is fully O(n) (the correct term is (n))).


Performance Formulae and Empirical Results.

The various examples seem to indicate that “open addressing” - the hashing technique that uses an array and keeps probing according to some method until an empty position is found - can exhibit “many” collisions, while collisions resolution via chaining seems to exhibit relatively few.

Is this impression justified, or is it just an artifact of the examples we have chosen? The answer is obviously important, since it can drive our decision of what to implement when asked for a “hash table”...


Size Filled Load Successful Search Unsuccessful Search

M N = N/M (1/2)(1 + 1/(1 - )) (1/2)(1 + 1/(1 - )2)

M N = N/M (1/)ln(1/(1 - )) 1/(1 - )

M N = N/M 1 + /2

Linear

Double

Chain

Linear Successful

Linear Unsuccessful

Search in Table up to 90% Full

Thanks to Maple V Symbolic Computation Software


Expected Number of Probes

Double Unsuccessful

Double Successful

Chain Successful

Chain Unsuccessful


One of the unfortunate (or maybe interesting?) things about reality is that our mathematical models, however sophisticated they may be, may not model reality very well…

The usual “validation” of a model goes through one or more experiments, where data are collected and the empirical results are compared to the predictions of the model. If the two coincide (within your capacity to compare the two), then the mathematical model can be believed (until you have reason to change your mind) to reflect reality “accurately”. If they don’t, you have to decide what to do. The model predictions may be “optimistic” or “pessimistic” or just plain wrong and “all over the map”.


What is “reality” in this case? How much “reality” can we afford?

Example 1: to test the predictions of various Physics models, we were going to build the “Superconducting Supercollider” under a good chunk of Texas. Then Congress decided we just couldn’t afford that much “reality”.


Example 2: Network protocols and packet management algorithms are obviously quite important to the well-being of the information we exchange in all the computer networks.

New ideas don’t get tested on “real stuff” at first: it’s just too expensive.

They get tested on simulators and only then will somebody begin the process to design and build boards. Many seemingly good theoretical algorithms turn out to have other undesirable behaviors - and have to be thrown away...


Reality, in our current case, involves running a (pseudo-) random number generator, building keys from random numbers, applying the hash function, and keeping track of the number of collisions and probes. Cheap, and reasonably effective. On the other hand, not very good if you have reason to believe your REAL keys are NOT random…

What are the “pseudo-real” results?

T will denote theoretical, E experimental.


Load Factors 0.10 0.25 0.50 0.75 0.90Successful Search

Separate Chaining 1.05 1.12 1.25 1.37 1.451.04 1.12 1.25 1.36 1.44

Open/linear probing 1.06 1.17 1.50 2.50 5.501.05 1.16 1.46 2.42 4.94

Open/double hashing 1.05 1.15 1.39 1.85 2.561.05 1.15 1.37 1.85 2.63

TETETE

0.99

1.491.4050.516.44.654.79

Unsuccessful SearchSeparate Chaining 0.10 0.25 0.50 0.75 0.90

0.10 0.21 0.47 0.80 0.93Open/linear probing 1.12 1.39 2.50 8.50 50.5

1.11 1.37 2.38 8.36 39.1Open/double hashing 1.11 1.33 2.00 4.00 10.0

1.11 1.33 2.00 4.10 10.9

TETETE

0.990.975000360.910098.5

91.102 - Computing II Table Size = 997

How do you choose a hash function?

This is obviously rather important and we have not really addressed the whole area: we have mentioned a way of using the ASCII representation of non-numeric data to provide input to a hash function, we have mentioned some of the difficulties encountered if you choose a bad functions. The text has walked us through some slightly different choices - still mostly based on variants of ASCII representation - but we have been silent on the choice of a good one. Why?

Because we have no simple rule for choosing a good hash function.


Things to avoid:

A) If you use the ASCII representation of the key as input to the hash function, do not use a hash function that involves division by a power of 2. The reason is simple: you end up with just using the trailing (or leading) x (= size of the table) binary digits of the key, ignoring all others. This would not be so bad if the input were truly random, but many human designed identifiers within the same program will share prefixes or suffixes, because they share the human semantics of the variables they stand for, and it is convenient for US to group objects of similar meaning under similar names…


B) Don’t commit to a hash function without some experiments on data sets that you believe to be representative of your application’s needs.

C) Don’t use a hash function that requires a "large" amount of computation - hash functions should be simple, or their advantage - if there is one - will be lost.


Some tested methods:

A) Choose a table size which is a prime number, say M, and divide the ASCII representation of the key by M. Use the remainder of the division as the value of the function h(Key) = ASCII(Key) % M. For a secondary hash function (double hashing) you can use the value of the quotient, p(Key) = max(1, (ASCII(Key)/M) % M). The 1 ensures that you won’t get stuck in case of collision: you must move by at least one position.

This seems reasonably fast, can be made to guarantee that the whole table will be used, and does not seem to have the “power of two” problem inherent in the ASCII representation of keys.


What’s the downside? The “power of two” problem is still there, but buried in a much trickier way.

How: assume M = rk ± a, where k and a are small integers. Assume, for the sake of definiteness that

M = 65537 = 216 + 1 = (28)2 + 1,

so r = 256, k = 2, a = +1.

M is a Fermat prime.


Consider three-character keys C3C2C1 in ASCII, where each 0 ≤ Ci ≤ 255. What is h(C3C2C1)?

h(C3C2C1) = C3C2C1 % [(28)2 + 1]

= (C2C1 - C3)256 (base 256 - to be shown )

= (C3*2562 + C2*256 + C1) % (2562 + 1).

If C2*256 + C1 > C3 ≥ 1, which is obviously the case at least whenever C2 ≥ 1, the quotient of C3C2C1 divided by (2562 + 1) is exactly C3. We just assumed that

C3*2562 + C3 < C3*2562 + C2*256 + C1

< C3*2562 + 2562


Or:

C3*(2562 + 1) < C3*2562 + C2*256 + C1

< (C3 + 1)*2562 < (C3 + 1)*(2562 + 1)

So dividing the second term by (2562 + 1) MUST give a quotient = C3. The remainder must then be

C3*2562 + C2*256 + C1 - C3*(2562 + 1)

= C2*256 + C1 - C3,

Or (C2C1 - C3)256, a simple difference of products of characters.

This is NOT likely to be uniformly distributed, and not even “decently” distributed, giving rise to many collisions.


B) Folding. Divide the key into sections of “size” = M. Add the sections, modulo M. You could choose subtraction, multiplication, or some combination of all three. This will be fast, and will tend to “blur” the clustering due to repeating sequences of characters.

C) Middle-Squaring. Take the middle digits, square them and compute the remainder after division by M. This assumes the keys are uniformly distributed - so that the choice of the middle digits will NOT result in long runs of equal values. Unfortunately, the squares are NOT uniformly distributed over the range of all squares. More clustering…


D) Truncation. Just use the right number of leading or trailing digits. This is not good for uniform spreading of the keys.

E) A multiplicative method that uses the “golden ratio”.


Comparative Performance of Table ADT Representations.

Init Full?Search

RetrieveUpdate

Insert Delete Enum

Sorted Array ofstructs

O(n) O(1) O(log n) O(n) O(n) O(n)

AVL Tree ofstructs

O(1) O(1) O(log n) O(log n)O(log n) O(n)

Hash Table O(1) O(1) O(1) O(1) O(1) ?

We don’t quite know this yet, but sorted enumeration of the entries can be done in O(n log n) for Hash Tables.


Why would you not ALWAYS use a hash table?

Because many of the crucial O(1) operations are so only probabilistically;

Because a choice of a good hash function is hard and too dependent on the actual incoming data…

Because they are difficult to maintain as "dynamic" data structures: they don't grow and shrink easily.

Again, we have been deprived of our free lunch...


91.102 - Computing II Tables and Hashing. We continue our quest for fast searching and insertion...

Documents

Transcript of 91.102 - Computing II Tables and Hashing. We continue our quest for fast searching and insertion...