CS2420: Lecture 42 Vladimir Kulyukin Computer Science Department Utah State University.
CS2420: Lecture 33 Vladimir Kulyukin Computer Science Department Utah State University.
-
date post
21-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of CS2420: Lecture 33 Vladimir Kulyukin Computer Science Department Utah State University.
Motivation• Recall Big Question 4:
– How can I retrieve/search data efficiently?
• After investigating the balanced binary search trees (AVL, Red-Black), we can ask:
– Is it possible to break the log(n) barrier for insertion and deletion?
Hash Tables
• A hash table is a data structure that was invented as an attempt to break the log(N) insertion and deletion barrier of the balanced binary search trees.
• Conceptually, a hash table is an array of items plus a hash function that maps arbitrary objects to indices of the array.
• A hash function first extracts a key from a given object and then maps the key into a legal array index.
• For example, if an object is an employee record, the key could be the employee’s SSN or the employee’s first and last names.
• Typical keys are numbers and strings.
Hash Functions
• It is impossible to find a hash function that computes indices (two different array cells) for any two distinct keys. Why? Because there are infinitely many keys, but only finitely many slots in the table.
• Question: What are we to do?
• Answer: Look for hash functions that distribute keys evenly among the cells.
Three Hashing Problems
• Choose a hash function:– Simple and fast;– Distributes keys evenly.
• Choose a table size.
• Choose a collision resolution strategy (what to do when several keys are mapped to the same index).
Choosing a Hash Function
• If keys are integers, Key Mod TableSize is a sensible strategy.
• Caveat: Keys should be random and should not have some undesirable properties.
• For example, if TableSize = 10 and all keys end in 0, Key Mod TableSize is not a sensible strategy.
Choosing a Table Size
• To avoid the situations with uneven key distributions, TableSize is typically a prime number.
• When keys are random integers Key Mod TableSize works fairly well.
A Hash Function: Example 1
int hash(const string& key, int tableSize){
int hashVal = 0;
for(int i = 0; i < key.length(); i++) {hashVal += key[i];
}
return hashVal % tableSize;}
Comments on hash1
• Easy to compute and fast.• If the TableSize is large, the function may
not distribute keys well.• Why? • Suppose TableSize = 10,007 (a prime)
and all keys are ASCII strings of length 8 or smaller.
• hash1’s range is [0, 127*8=1016].• This is NOT an acceptable distribution.
Hash Function: Example 2
.37]0[
...37]2[37]1[
3712
1
10
1
0
KeyLength
Keylength
i
i
Key
KeylengthKeyKeylengthKey
iKeylengthKeyKeyhash
Hash Function: Example 2int hash2(const string &key, int tableSize){
int hashVal = 0;for(int j=0; j < key.length(); j++) {
hashVal = 37 * hashVal + key[j];}
hashVal %= tableSize;if ( hashVal < 0 ) {
hashVal += tableSize;}
return hashVal;}
Comments On Hash2
• Easy to compute.
• Fast on relatively short keys.
• Distributes keys fairly well.
• Potential problems with very long keys, because there will be lots of buffer overflows and collisions.
Collision Resolution
A collision occurs when an element is inserted under a key that hashes to the cell that is already occupied with a different element.
Separate Chaining
• Separate chaining keeps a list of all elements whose keys hash to the same index.
• What does it mean?• Under separate chaining, a hash table is an
array of lists.• The term “lists” is used rather loosely in the
previous statement. It can be an array of AVL search trees or an array of has tables. But the linked list remains the most common choice.
Hash Table: Implementation
template <class T>class CHashTable {
…private:
vector<list<T> > m_Lists;int m_Size;
…};int hash(const string &key) { …}int hash(const string &key) { …}
Hash Table: Implementation
class CEmployee {private:
string m_Name;double m_Salary;
…};int hash(const Employee &x) {
return hash(x.GetName());}