Hashing – Part I
description
Transcript of Hashing – Part I
Hashing – Part I
CS 367 – Introduction to Data Structures
Searching
• Up to now the only way to find a key is to search through all or part of the data– linked list: O(n)– AVL tree: O(log n)– binary search of array: O(log n)
• If lots of data and/or searching the data very often, these times can be long– given the key, would like to get the data
directly
Hashing
• The solution to this problem is to put the key through a function that says exactly where the data is (or where it should be placed)– this function is called a hash function
• h(key) = integer
– the integer obtained from a hash function can be used as an index into an array
• if the hash function is perfect – always generates a unique integer for different keys – the time to place and access data is O(1)
Hashing
0 1 2 3 4 5 6 7 8 9 10 11
Hashing Function
A M X
A M X
Hashing Functions
• So what is the hashing function?– the simplest hashing function is to use the
division remainder• assume the array is 1000 elements in size• translate the data into a number, n• h(n) = n % 1000
Hashing Functions• simple example
– consider a small school– each student is tracked by a 4 digit ID number– each students ID# begins with the year they
started• 2000 -> 0, 2001->1, 2002->2, etc.
– all student records are stored in an array• maximum of 1000 students per year
– let’s look at records for all sophomores• assume they were freshman in 2001
Hashing Functions
0 1 2 3 4 5 6 7 8 9 10 11
Mary’srecords
Pete’srecords
John’srecords
Amy’srecords
…
Mary’s ID #: 1000Pete’s ID #: 1004John’s ID #: 1009Amy’s ID#: 1011
To find John’s record in the array:
1009 % 1000 = 9
Go to index number 9.
Generating n
• The previous example is rather simplistic in that it is hashing already unique integers– seems kind of pointless– maybe not if the integers are large
• consider the UW’s 10 digit ID numbers
• Often it is desirable to hash some other kind of data– a person’s name for example
Generating n
• How is a string converted into an integer?– the simplest method is to add all of the ASCII
values for each character together– example
• convert amy into an integer– a = 97; m = 109; y = 121– a + m + y = 327
– there are lots of other ways to convert strings to integers
• what are a few of them?
Hashing Functions
• There are millions of possible hashing functions– we will not be considering them all– basically, anything you can think of to
generate an integer could be used as a hashing function
• Mathematicians have spent lots of time and effort to come up with some basic methods that work pretty well
Division
• We have already seen the division method– it involves taking the remainder of division
• h(key) = key % tableSize
• A few notes about making this work better– table size should be a prime number– usually a good method if nothing very little is
known about the keys– the remaining methods will all use division as
the final step in their calculation
Folding
• Separate the key into various equally sized parts and then recombine them– usually with addition
• Two kinds of folding– shift folding
• just add the various parts together as they are
– boundary folding• reverse the order of every other part and add them
together
Folding• Consider a SSN as a key
– break it into 3 parts• first 3, second 3, last 3
• Shift folding example– SSN = 123-45-6789– first = 123; second = 456; third = 789– h(key) = (first + second + third) % size
• h(SSN) = 1368 % tableSize
• Boundary folding example– h(key) = (first + R(second) + third) % size– h(key) = (123 + 654 + 789) % size
Increasing Performance
• Consider using shifting and exclusive OR’ing to generate the key– exclusive OR parts together to generate index
• Example– consider the string abcdefgh– if each part is a letter, just exclusive OR them
• ‘a’ ^ ‘b’ ^ ‘c’ ^ ‘d’ ^ ‘e’ ^ ‘f’ ^ ‘g’ ^ ‘h’
– often, a character is represented by 8 bits• what’s the problem with this?
– might be better to exclusive OR chunks of the string• “abcd” ^ “efgh”• why were four digits chosen in this case?
Increasing Performanceint shiftFold(String key, int tableSize) {
int chunk = 0;
int result = 0;
byte[ ] st = key.getBytes();
for(int i=0; i<st.length; i+=4) {
for(int j=0; (j<4) && (j + i < st.length); j++) {
chunk = chunk | st[j + i];
chunk = chunk << 8;
}
result = result ^ chunk;
chunk = 0;
}
return result % tableSize;
}
Increasing Performance
• The performance could be increased even more if the table size was a power of 2– can get rid of the modulo operation at the end– modulo is an expensive calculation– could just do a subtraction and an AND
operation instead
Mid-Square Function
• Square the number and take the middle part as the index– a string must first be converted to get the
number to square
• The entire key gets used to generate the address– less chance for conflicts
• This method works best if the table size is a power of two
Mid-Square Function
• Table size equals 1024 (210)
• The key is 3121– 31212 = 9740441 =
(100101001010000101100001)2
– middle 10 digits of this value are listed in bold
• Index in array is– (0101000010)2 = 322
• This is all very quick and easy to calculate using mask and shift operations
Mid-Square Function
int tableSize = 1024;
int mask = (tableSize – 1) ;
int maskBits = logBase2(tableSize);
int shiftBits = 7;
// table size must be a power of two
int midSquare(String key, int tableSize) {
int n = stringToNum(key);
int n = n * n;
return n & (mask << shiftBits);
}
Extraction
• Simply pull out a certain part of the key and use it as the index– example
• SSN = 123-45-6789• index = middle of key = 456• alternative index = first, middle, last = 159
• Should try to choose a part of the key that is most likely unique– consider foreign student SSN– start with 999
• probably not a great idea to extract the first three numbers