Chapter 5: Hashing• Hash Table ADT• Hash Functions
CS 340 Page 1
• Collision Resolution• Rehashing
CS 340 Page 2
HashingHashing is a technique for performing searches, insertions, and deletions from a list in constant time.A particular component of each data element being stored is used as a key which is mapped to a particular cell in a hash table.Problems arise when collisions occur, i.e., when multiple data elements are mapped to the same cell.
The term “hash” was coined to illustrate the analogy between
hashing and the culinary practice of chopping and mixing
ingredients to make a hash.Essentially, the input domain is
“chopped” into several subdomains, which are then
“mixed” into the output range to improve the uniformity of
their distribution.
CS 340 Page 3
A hash table is a list of keys, mapped to particular cells via a hash function.• The table is implemented as a fixed-size array.• The table size and hash function are strategically chosen to
avoid collisions.Example: A hash table to hold the CS Department faculty and staff
The Hash Table Abstract Data Type
Hash function: length of last nameTable size: 11
Collision-free keys: 9Ave. # comparisons per name: 1.50
Hash function: (Office room #) % 15Table size: 15
Collision-free keys: 12Ave. # comparisons per name: 1.42
Hash function: ((Sum of office room # digits) * (# of vowels in last name) +
(Last 2 digits of office phone #)) % 25Table size: 25
Collision-free keys: 24Ave. # comparisons per name: 1.08
Yu
WangKlein-Mayer-White
StefikBouvier-Cummins-
EhlmannFujinoki
TornaritisBartholomew
Klein
Stefik
Fujinoki-YuEhlmann-Tornaritis
WhiteMayerBouvier
Bartholomew-Cummins-Wang
Wang
StefikKlein
Tornaritis
EhlmannBouvier-MayerYu
CumminsFujinoki
White
Bartholomew
CS 340 Page 4
Define the load factor, , of a hash table to be the ratio of the number of elements in the hash table to the table size.
• If > 1, then collisions are inevitable, so it is wise to choose a table size greater than the number of anticipated elements.
• If << 1, then there will be a very large number of empty slots, lessening the probability of a collision, but wasting a lot of memory.
Example: A hash table to hold the 2011 holidays
Choosing The Hash Table Size
Table size: 12Hash function: Month #
Load factor: 1.5
Table size: 43Hash function: (Month #) + (Day #)
Load factor: 0.419
New Year’s Day
Independence DayLabor Day
Mother’s DayLincoln’s Birthday
Valentine’s Day
Martin Luther King, Jr. DaySt. Patrick’s Day - Flag Day -
Columbus Day
Veterans Day
Washington’s BirthdayFather’s Day
Easter Sunday
Memorial Day - Thanksgiving
Christmas
Halloween
Mew Year’s Day – Martin Luther King, Jr. DayLincoln’s Birthday – Valentine’s Day – Washington’s
BirthdaySt. Patrick’s DayEaster Sunday
Mother’s Day – Memorial DayFlag Day – Father’s Day
Independence Day
Labor DayColumbus Day – Halloween
Veterans Day – ThanksgivingChristmas
CS 340 Page 5
Given a particular hash table size and a particular type of data, the hash function should be chosen to minimize the number of collisions.
This usually requires an in-depth analysis of the keys expected to go in the table.
Example: A hash table to hold CS undergraduate course enrollment statistics
There are 24 active courses: 108, 140, 145, 150, 240, 275, 312, 314, 321, 325, 330, 340, 390, 423, 425, 434, 438, 447, 454, 456, 482, 490, 495, and 499.
Choosing The Hash Function
Summing the digits yields: 9, 5, 10, 6, 6, 14, 6, 8, 6,
10, 6, 7, 12, 9, 11, 9, 15, 15, 13, 15, 14, 13, 18,
and 22. (Table size:
18;7 non-
collisions.)
Summing (one’s digit) + 2*(ten’s digit) + 4*(100’s digit) yields: 12, 12, 17, 14, 16, 27, 16, 20, 17, 21, 18, 28, 30, 23, 26, 30, 31, 30, 32, 34, 34,
39, and 43. (Table size: 32;
10 non-collisions.)
Summing (one’s digit) + 3*(ten’s digit) + 9*(100’s digit) yields: 17,
21, 26, 24, 30, 44, 32, 34, 34, 38, 36, 39, 54, 45, 47, 49, 53, 55, 55, 57, 62,
63, 68, and 72.(Table size: 56;
20 non-collisions.)
Summing (100’s digit) + 3*(ten’s digit) + 9*(one’s digit) yields: 73,
13, 58, 16, 14, 68, 24, 26, 18, 54, 12, 15, 30, 37, 55, 49, 85, 79, 55, 73, 46, 31, 76, and 112.(Table size: 100;
20 non-collisions.)
CS 340 Page 6
What should be done when a collision does occur?There are two main strategies: separate chaining and probing.Separate ChainingWith separate chaining, the hash table is an array of linked lists, with each linked list containing all of the elements that map to the same value.
Collision Resolution
Disadvantages:Average successful search: 1+(/2)
comparisonsAverage unsuccessful search:
comparisonsWorst case search: n comparisons (for
a bad hash function)
Martin Luther King, Jr. DayValentine’s Day
Memorial Day
Thanksgiving Day
New Year’s DayLincoln’s BirthdaySt. Patrick’s DayEaster SundayMother’s Day
Flag DayIndependence Day
Labor DayColumbus DayVeterans DayChristmas Day
Washington’s Birthday
Halloween
Father’s. Day
CS 340 Page 7
ProbingWith probing, the hash table is an array of values, with a whole series of cells probed until no collision occurs (i.e., cells h0(x), h1(x), h2(x),… are tried, where hi(x) = (Hash(x) + f(i)) mod tablesize, with f(0) = 0).Linear Probing: f(i) is a linear functionExample: f(i) = 3i and Hash(x) = x
Collision Resolution (Continued)
1492insert 1492
(slot 2)
1492
1776
insert 1776
(slot 6)
1492
18121776
insert 1812
(slot 2 slot 5)
1492
18121776
1945
insert 1945
(slot 5 slot 8)
19681492
18121776
1945
insert 1968
(slot 8 slot 1)
19681492
199218121776
1945
insert 1992
(slot 2 slot 5 slot 8 slot 1 slot 4)
Problems With Linear Probing:• Coefficient and table size must be
relatively prime or free cells may not be found.
• Bad tendency to experience primary clustering, resulting in many collisions.
CS 340 Page 8
Quadratic Probing: f(i) is a quadratic function
Example: f(i) = 2i 2 and Hash(x) = x
Collision Resolution (Continued)
1492insert 1492
(slot 2)
1492
1776
insert 1776
(slot 6)
1492
1812
1776
insert 1812
(slot 2 slot 4)
1492
181219451776
insert 1945
(slot 5)
insert 1968 1492
181219451776
1968
(slot 8)
1992
1492
181219451776
1968
insert 1992
(slot 2 slot 4 slot 0)
Problems With Quadratic Probing:• Coefficient and table size must be
carefully chosen or free cells may be ignored.
• Bad tendency to experience secondary clustering, since keys with the same original hashed value will follow the same sequence of cells through the table.
CS 340 Page 9
Double Hashing: f(i) is a second hash function, multiplied by an iterative value
Example: f(i) = iHash2(x), where Hash2(x) = 7 - x mod 7, and Hash(x) = x
Collision Resolution (Continued)
1492insert 1492
(slot 2)
1492
1776
insert 1776
(slot 6)
14921812
1776
insert 1812
(slot 2 slot 3)
14921812
19451776
insert 1945
(slot 5)
14921812
19451776
1968
insert 1968
(slot 8)
199214921812
19451776
1968
insert 1992
(slot 2slot 5 slot 8 slot 1)
Problems With Double Hashing:• A strategic choice must be made for
both hashing functions.• Calculation will be much more
expensive in the event of a collision.
CS 340 Page 10
When a hash table starts getting too full, with many delays caused by repeated collisions, rehashing the values into a new, larger table with a new hash function may alleviate the problem.
Rehashing
Insert:Bush2 (2001)Clinton (1993)Bush (1989)
Reagan (1981)Carter (1977)Ford (1974)
Nixon (1969)Johnson (1963)Kennedy (1961)
Eisenhower (1953)
NixonReaganClinton
Kennedy
FordJohnson
EisenhowerCarterBush
Bush2
REHASH
Hash (president) = first_year_in_office mod
11
Hash (president) = first_year_in_office mod 23
Bush2
Reagan
ObamaKennedy
Johnson
Bush
NixonClinton
Ford
EisenhowerCarter
Inserting Obama (2009) would cause a collision in slot 7, so…
CS 340 Page 11
Hashing is used extensively in modern programming, particularly in database management, network security, and operating systems.Consequently, most modern programming languages have built-in mechanisms for implementing associative arrays, i.e., dictionaries based on the key-value concept of hash tables.
Associative ArraysC++ “map”:#include <map>#include <string>using namespace std;void main(){ map<string, string> phone_book; phone_book["Sally Smart"] = "555-9999"; phone_book["John Doe"] = "555-1212"; phone_book["J. Random Hacker"] = "553-1337";}
Java “map”:Map<String, String> phoneBook = new HashMap<String, String>();phoneBook.put("Sally Smart", "555-9999"); phoneBook.put("John Doe", "555-1212"); phoneBook.put("J. Random Hacker", "555-1337");
Lua “table”:phone_book = { ["Sally Smart"] = "555-9999", ["John Doe"] = "555-1212", ["J. Random Hacker"] = "553-1337", -- Trailing comma is OK}aTable = { -- Table as value subTable = { 5, 7.5, k = true }, -- key is "subTable " -- Function as value ['John Doe'] = function (age) if age < 18 then return "Young" else return "Old!" end end, -- Table and function (and other types) can also be used as keys}
Perl “hash”:%phone_book = ( 'Sally Smart' => '555-9999', 'John Doe' => '555-1212', 'J. Random Hacker' => '553-1337',);
Python “dictionary”:phonebook = { 'Sally Smart' : '555-9999', 'John Doe' : '555-1212', 'J. Random Hacker' : '553-1337'}
Ruby “hash”:phonebook = { 'Sally Smart' => '555-9999', 'John Doe' => '555-1212', 'J. Random Hacker' => '553-1337' }
Top Related