Hashing – Part I

Hashing – Part I

CS 367 – Introduction to Data Structures

Searching

• Up to now the only way to find a key is to search through all or part of the data– linked list: O(n)– AVL tree: O(log n)– binary search of array: O(log n)

• If lots of data and/or searching the data very often, these times can be long– given the key, would like to get the data

directly

Hashing

• The solution to this problem is to put the key through a function that says exactly where the data is (or where it should be placed)– this function is called a hash function

• h(key) = integer

– the integer obtained from a hash function can be used as an index into an array

• if the hash function is perfect – always generates a unique integer for different keys – the time to place and access data is O(1)

Hashing

0 1 2 3 4 5 6 7 8 9 10 11

Hashing Function

A M X

A M X

Hashing Functions

• So what is the hashing function?– the simplest hashing function is to use the

division remainder• assume the array is 1000 elements in size• translate the data into a number, n• h(n) = n % 1000

Hashing Functions• simple example

– consider a small school– each student is tracked by a 4 digit ID number– each students ID# begins with the year they

started• 2000 -> 0, 2001->1, 2002->2, etc.

– all student records are stored in an array• maximum of 1000 students per year

– let’s look at records for all sophomores• assume they were freshman in 2001

Hashing Functions

0 1 2 3 4 5 6 7 8 9 10 11

Mary’srecords

Pete’srecords

John’srecords

Amy’srecords

…

Mary’s ID #: 1000Pete’s ID #: 1004John’s ID #: 1009Amy’s ID#: 1011

To find John’s record in the array:

1009 % 1000 = 9

Go to index number 9.

Generating n

• The previous example is rather simplistic in that it is hashing already unique integers– seems kind of pointless– maybe not if the integers are large

• consider the UW’s 10 digit ID numbers

• Often it is desirable to hash some other kind of data– a person’s name for example

Generating n

• How is a string converted into an integer?– the simplest method is to add all of the ASCII

values for each character together– example

• convert amy into an integer– a = 97; m = 109; y = 121– a + m + y = 327

– there are lots of other ways to convert strings to integers

• what are a few of them?

Hashing Functions

• There are millions of possible hashing functions– we will not be considering them all– basically, anything you can think of to

generate an integer could be used as a hashing function

• Mathematicians have spent lots of time and effort to come up with some basic methods that work pretty well

Division

• We have already seen the division method– it involves taking the remainder of division

• h(key) = key % tableSize

• A few notes about making this work better– table size should be a prime number– usually a good method if nothing very little is

known about the keys– the remaining methods will all use division as

the final step in their calculation

Folding

• Separate the key into various equally sized parts and then recombine them– usually with addition

• Two kinds of folding– shift folding

• just add the various parts together as they are

– boundary folding• reverse the order of every other part and add them

together

Folding• Consider a SSN as a key

– break it into 3 parts• first 3, second 3, last 3

• Shift folding example– SSN = 123-45-6789– first = 123; second = 456; third = 789– h(key) = (first + second + third) % size

• h(SSN) = 1368 % tableSize

• Boundary folding example– h(key) = (first + R(second) + third) % size– h(key) = (123 + 654 + 789) % size

Increasing Performance

• Consider using shifting and exclusive OR’ing to generate the key– exclusive OR parts together to generate index

• Example– consider the string abcdefgh– if each part is a letter, just exclusive OR them

• ‘a’ ^ ‘b’ ^ ‘c’ ^ ‘d’ ^ ‘e’ ^ ‘f’ ^ ‘g’ ^ ‘h’

– often, a character is represented by 8 bits• what’s the problem with this?

– might be better to exclusive OR chunks of the string• “abcd” ^ “efgh”• why were four digits chosen in this case?

Increasing Performanceint shiftFold(String key, int tableSize) {

int chunk = 0;

int result = 0;

byte[ ] st = key.getBytes();

for(int i=0; i<st.length; i+=4) {

for(int j=0; (j<4) && (j + i < st.length); j++) {

chunk = chunk | st[j + i];

chunk = chunk << 8;

}

result = result ^ chunk;

chunk = 0;

}

return result % tableSize;

}

Increasing Performance

• The performance could be increased even more if the table size was a power of 2– can get rid of the modulo operation at the end– modulo is an expensive calculation– could just do a subtraction and an AND

operation instead

Mid-Square Function

• Square the number and take the middle part as the index– a string must first be converted to get the

number to square

• The entire key gets used to generate the address– less chance for conflicts

• more on this later

• This method works best if the table size is a power of two

Mid-Square Function

• Table size equals 1024 (210)

• The key is 3121– 31212 = 9740441 =

(100101001010000101100001)2

– middle 10 digits of this value are listed in bold

• Index in array is– (0101000010)2 = 322

• This is all very quick and easy to calculate using mask and shift operations

Mid-Square Function

int tableSize = 1024;

int mask = (tableSize – 1) ;

int maskBits = logBase2(tableSize);

int shiftBits = 7;

// table size must be a power of two

int midSquare(String key, int tableSize) {

int n = stringToNum(key);

int n = n * n;

return n & (mask << shiftBits);

}

Extraction

• Simply pull out a certain part of the key and use it as the index– example

• SSN = 123-45-6789• index = middle of key = 456• alternative index = first, middle, last = 159

• Should try to choose a part of the key that is most likely unique– consider foreign student SSN– start with 999

• probably not a great idea to extract the first three numbers

Hashing – Part I

Documents

Transcript of Hashing – Part I