Download - Chapter 5: Hashing

Chapter 5: Hashing• Hash Table ADT• Hash Functions

CS 340

• Collision Resolution• Rehashing

CS 340

HashingHashing is a technique for performing searches, insertions, and deletions from a list in constant time.A particular component of each data element being stored is used as a key which is mapped to a particular cell in a hash table.Problems arise when collisions occur, i.e., when multiple data elements are mapped to the same cell.

The term “hash” was coined to illustrate the analogy between

hashing and the culinary practice of chopping and mixing

ingredients to make a hash.Essentially, the input domain is

“chopped” into several subdomains, which are then

“mixed” into the output range to improve the uniformity of

their distribution.

CS 340

A hash table is a list of keys, mapped to particular cells via a hash function.• The table is implemented as a fixed-size array.• The table size and hash function are strategically chosen to

avoid collisions.Example: A hash table to hold the CS Department faculty and staff

The Hash Table Abstract Data Type

Hash function: length of last nameTable size: 11

Collision-free keys: 9Ave. # comparisons per name: 1.50

Hash function: (Office room #) % 15Table size: 15


Hash function: ((Sum of office room # digits) * (# of vowels in last name) +

(Last 2 digits of office phone #)) % 25Table size: 25


Yu

WangKlein-Mayer-White

StefikBouvier-Cummins-

EhlmannFujinoki

TornaritisBartholomew

Klein

Stefik

Fujinoki-YuEhlmann-Tornaritis

WhiteMayerBouvier

Bartholomew-Cummins-Wang

Wang

StefikKlein

Tornaritis

EhlmannBouvier-MayerYu

CumminsFujinoki

White

Bartholomew

CS 340

Define the load factor, , of a hash table to be the ratio of the number of elements in the hash table to the table size.

• If > 1, then collisions are inevitable, so it is wise to choose a table size greater than the number of anticipated elements.

• If << 1, then there will be a very large number of empty slots, lessening the probability of a collision, but wasting a lot of memory.

Example: A hash table to hold the 2011 holidays

Choosing The Hash Table Size

Table size: 12Hash function: Month #

Load factor: 1.5

Table size: 43Hash function: (Month #) + (Day #)

Load factor: 0.419

New Year’s Day

Independence DayLabor Day

Mother’s DayLincoln’s Birthday

Valentine’s Day

Martin Luther King, Jr. DaySt. Patrick’s Day - Flag Day -

Columbus Day

Veterans Day

Washington’s BirthdayFather’s Day

Easter Sunday

Memorial Day - Thanksgiving

Christmas

Halloween

Mew Year’s Day – Martin Luther King, Jr. DayLincoln’s Birthday – Valentine’s Day – Washington’s

BirthdaySt. Patrick’s DayEaster Sunday

Mother’s Day – Memorial DayFlag Day – Father’s Day

Independence Day

Labor DayColumbus Day – Halloween

Veterans Day – ThanksgivingChristmas

CS 340

Given a particular hash table size and a particular type of data, the hash function should be chosen to minimize the number of collisions.

This usually requires an in-depth analysis of the keys expected to go in the table.

Example: A hash table to hold CS undergraduate course enrollment statistics

There are 24 active courses: 108, 140, 145, 150, 240, 275, 312, 314, 321, 325, 330, 340, 390, 423, 425, 434, 438, 447, 454, 456, 482, 490, 495, and 499.

Choosing The Hash Function

Summing the digits yields: 9, 5, 10, 6, 6, 14, 6, 8, 6,

10, 6, 7, 12, 9, 11, 9, 15, 15, 13, 15, 14, 13, 18,

and 22. (Table size:

18;7 non-

collisions.)

Summing (one’s digit) + 2*(ten’s digit) + 4*(100’s digit) yields: 12, 12, 17, 14, 16, 27, 16, 20, 17, 21, 18, 28, 30, 23, 26, 30, 31, 30, 32, 34, 34,

39, and 43. (Table size: 32;

10 non-collisions.)

Summing (one’s digit) + 3*(ten’s digit) + 9*(100’s digit) yields: 17,

21, 26, 24, 30, 44, 32, 34, 34, 38, 36, 39, 54, 45, 47, 49, 53, 55, 55, 57, 62,

63, 68, and 72.(Table size: 56;

20 non-collisions.)

Summing (100’s digit) + 3*(ten’s digit) + 9*(one’s digit) yields: 73,

13, 58, 16, 14, 68, 24, 26, 18, 54, 12, 15, 30, 37, 55, 49, 85, 79, 55, 73, 46, 31, 76, and 112.(Table size: 100;

20 non-collisions.)

CS 340

What should be done when a collision does occur?There are two main strategies: separate chaining and probing.Separate ChainingWith separate chaining, the hash table is an array of linked lists, with each linked list containing all of the elements that map to the same value.

Collision Resolution

Disadvantages:Average successful search: 1+(/2)

comparisonsAverage unsuccessful search:

comparisonsWorst case search: n comparisons (for

a bad hash function)

Martin Luther King, Jr. DayValentine’s Day

Memorial Day

Thanksgiving Day

New Year’s DayLincoln’s BirthdaySt. Patrick’s DayEaster SundayMother’s Day

Flag DayIndependence Day

Labor DayColumbus DayVeterans DayChristmas Day

Washington’s Birthday

Halloween

Father’s. Day

CS 340

ProbingWith probing, the hash table is an array of values, with a whole series of cells probed until no collision occurs (i.e., cells h0(x), h1(x), h2(x),… are tried, where hi(x) = (Hash(x) + f(i)) mod tablesize, with f(0) = 0).Linear Probing: f(i) is a linear functionExample: f(i) = 3i and Hash(x) = x

Collision Resolution (Continued)

1492insert 1492

(slot 2)

1492

1776

insert 1776

(slot 6)

1492

18121776

insert 1812

(slot 2 slot 5)

1492

18121776

1945

insert 1945

(slot 5 slot 8)

19681492

18121776

1945

insert 1968

(slot 8 slot 1)

19681492

199218121776

1945

insert 1992

(slot 2 slot 5 slot 8 slot 1 slot 4)

Problems With Linear Probing:• Coefficient and table size must be

relatively prime or free cells may not be found.

• Bad tendency to experience primary clustering, resulting in many collisions.

CS 340

Quadratic Probing: f(i) is a quadratic function

Example: f(i) = 2i 2 and Hash(x) = x


1492insert 1492

(slot 2)

1492

1776

insert 1776

(slot 6)

1492

1812

1776

insert 1812

(slot 2 slot 4)

1492

181219451776

insert 1945

(slot 5)

insert 1968 1492

181219451776

1968

(slot 8)

1992

1492

181219451776

1968

insert 1992

(slot 2 slot 4 slot 0)

Problems With Quadratic Probing:• Coefficient and table size must be

carefully chosen or free cells may be ignored.

• Bad tendency to experience secondary clustering, since keys with the same original hashed value will follow the same sequence of cells through the table.

CS 340

Double Hashing: f(i) is a second hash function, multiplied by an iterative value

Example: f(i) = iHash2(x), where Hash2(x) = 7 - x mod 7, and Hash(x) = x


1492insert 1492

(slot 2)

1492

1776

insert 1776

(slot 6)

14921812

1776

insert 1812

(slot 2 slot 3)

14921812

19451776

insert 1945

(slot 5)

14921812

19451776

1968

insert 1968

(slot 8)

199214921812

19451776

1968

insert 1992

(slot 2slot 5 slot 8 slot 1)

Problems With Double Hashing:• A strategic choice must be made for

both hashing functions.• Calculation will be much more

expensive in the event of a collision.

CS 340

When a hash table starts getting too full, with many delays caused by repeated collisions, rehashing the values into a new, larger table with a new hash function may alleviate the problem.

Rehashing

Insert:Bush2 (2001)Clinton (1993)Bush (1989)

Reagan (1981)Carter (1977)Ford (1974)

Nixon (1969)Johnson (1963)Kennedy (1961)

Eisenhower (1953)

NixonReaganClinton

Kennedy

FordJohnson

EisenhowerCarterBush

Bush2

REHASH

Hash (president) = first_year_in_office mod

11

Hash (president) = first_year_in_office mod 23

Bush2

Reagan

ObamaKennedy

Johnson

Bush

NixonClinton

Ford

EisenhowerCarter

Inserting Obama (2009) would cause a collision in slot 7, so…

CS 340

Hashing is used extensively in modern programming, particularly in database management, network security, and operating systems.Consequently, most modern programming languages have built-in mechanisms for implementing associative arrays, i.e., dictionaries based on the key-value concept of hash tables.

Associative ArraysC++ “map”:#include <map>#include <string>using namespace std;void main(){ map<string, string> phone_book; phone_book["Sally Smart"] = "555-9999"; phone_book["John Doe"] = "555-1212"; phone_book["J. Random Hacker"] = "553-1337";}

Java “map”:Map<String, String> phoneBook = new HashMap<String, String>();phoneBook.put("Sally Smart", "555-9999"); phoneBook.put("John Doe", "555-1212"); phoneBook.put("J. Random Hacker", "555-1337");

Lua “table”:phone_book = { ["Sally Smart"] = "555-9999", ["John Doe"] = "555-1212", ["J. Random Hacker"] = "553-1337", -- Trailing comma is OK}aTable = { -- Table as value subTable = { 5, 7.5, k = true }, -- key is "subTable " -- Function as value ['John Doe'] = function (age) if age < 18 then return "Young" else return "Old!" end end, -- Table and function (and other types) can also be used as keys}

Perl “hash”:%phone_book = ( 'Sally Smart' => '555-9999', 'John Doe' => '555-1212', 'J. Random Hacker' => '553-1337',);

Python “dictionary”:phonebook = { 'Sally Smart' : '555-9999', 'John Doe' : '555-1212', 'J. Random Hacker' : '553-1337'}

Ruby “hash”:phonebook = { 'Sally Smart' => '555-9999', 'John Doe' => '555-1212', 'J. Random Hacker' => '553-1337' }