Chapter 11 Hash Anshuman Razdan Div of Computing Studies [email protected] razdan/cst230

32
Chapter 11 Hash Anshuman Razdan Div of Computing Studies [email protected] http://dcst2.east.asu.edu/~ra zdan/cst230/

Transcript of Chapter 11 Hash Anshuman Razdan Div of Computing Studies [email protected] razdan/cst230

Page 1: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

Chapter 11Hash

Anshuman RazdanDiv of Computing Studies

[email protected]://dcst2.east.asu.edu/~razdan/cst230/

Page 2: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 2

Searching

• Searching for a specific value among a collection of values is a common operation.

• Complexity of search/find using:– array– linked list– ordered list– binary tree– BST

Page 3: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 3

Linear Search• search an array A of n elements for a specified element target

i = 0; found = false;while( (i < n) && !found ) if( A[ i ] == (or equals) target

found = true; else

i++;if( found ) target is at position ielse target is not in array

Page 4: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 4

Complexity of Linear Search

• count # of comparisons that must be done.

• Worst Case

• Average Case

Page 5: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 5

Binary Search• search a sorted array A of n elements for a

specified element targetpublic static int BinarySearch( int[] A, int first, int n, int target ){int middle;if( n <= 0 ) found = -1;else{ middle = first + size/2; if( target == A[middle] ) found = middle; else if( target < A[middle] ) found = BinarySearch( A, first, n/2, target ); else found = BinarySearch( A, middle+1, (n-1)/2, target );}return found;}

Page 6: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 6

Complexity of BinarySearch• BinarySearch body has constant time – so

we need to count the number of calls made to BinarySearch

• Find the depth of recursive calls – the length of the longest chain on recursive calls in the execution of an algorithm.

Page 7: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 7

Motivation: Direct Access is Fast• Suppose we have a large number of

products to store and that each product has a unique product ID.

• If n products have ID’s in range 0..n-1, we can store each product in an array at index prodID.– time to find product?

• If # ID’s is much smaller than range of ID’s storing each product at prodID is VERY space inefficient.

Page 8: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 8

Hashing• Each element has a unique key that identifies the element.• We have: large range of keys• We want: index of elements to be 0..numElem-1

key1...key2...key3...key4...keyn

0123...n-1

hash function

Page 9: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 9

Common hashing function: Mod• The mod function is a natural choice for

hashing because x mod n always results in a number in the range 0 .. n-1.

• E.g., Insert the following numbers into a hash table of size 10: 432, 321, 17, 65, 9388, 200, 83, 564

Page 10: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 10

Collisions• A perfect hashing function will produce a

different index for every key.• Unfortunately, mod is NOT perfect.

– 20 mod 10 = 0– 520 mod 10 = 0– 1030 mod 10 = 0– etc.

• When two (or more) distinct keys hash to the same index, we have a collision.

• There are various methods used to deal with collisions.

Page 11: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 11

Open-address Hashing• One method to deal with collisions is open-

addressing:– compute hash(key)– if data[hash(key)] is not occupied, insert key.

else– search forward starting at index hash(key) + 1

until a vacant position is found and insert key. (Note: array is circular, so that after the last index of the array is tried, index 0 is tried next.)

• This method is also called “linear probing”

Page 12: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 12

Example• Insert keys 89, 18, 49, 58, and 9 into a hash

table of size 10.

Page 13: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 13

Hashing non-integer keys• Many applications require collections of

objects with non-integer keys (often Strings).

• an encoding function converts the key to an integer, and the hash function is performed on the encoding.

• all Java classes (objects) include a method called hashCode.

• Note: keys must be unique – so encoding of keys must be unique as well. This is very important when designing an encoding scheme.

Page 14: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 14

Hashtable methods

• Common Hashtable methods are:– put put a new object into the table– containsKey search for object with specified

key (returns boolean)– get retrieve an object for a specified key– remove removes an object with a specified

key

Page 15: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 15

Example Implementationpublic class Hashtable{

private int manyItems; private Object[] keys; private Object[] data; private boolean[] hasBeenUsed;

private int hash(Object key){ return Math.abs(key.hashCode())%data.length; }

private int nextIndex(int i){ return (i+1) % data.length; }...

Page 16: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 16

Constructorpublic Hashtable( int capacity ){ if( capacity <= 0 ) throw new IllegalArgumentException (“Capacity is negative.”); keys = new Object[capacity]; data = new Object[capacity]; hasBeenUsed = new boolean[capacity];}

Page 17: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 17

findIndexprivate int findIndex( Object key ){ int count = 0; int i = hash(key); int retVal = -1; while( (count<data.length) && (hasBeenUsed[i]) && (retVal == -1) ){ if( key.equals(keys[i]) ) retVal = i; count++; i = nextIndex(i); } return retVal;}

Page 18: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 18

putpublic Object put(Object key, Object element){ int index = findIndex{key); Object answer = null; if( index != -1 ){ answer = data[index]; data[index] = element; } else if( manyItems < data.length ){ index = hash(key); while( keys[index] != null ) index = nextIndex(index); keys[index] = key; data[index] = element; hasBeenUsed[index] = true; manyItems++; } else throw new IllegalStateException (“Table is full”); return answer;}

Page 19: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 19

removepublic Object remove( key ){ int index = findIndex( key ); Object answer = null; if( index != -1 ){ answer = date[index]; keys[index] = null; data[index] = null; manyItems--; } return answer;}

Page 20: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 20

getpublic Object get( Object key ){ int index = findIndex( key ); Object answer = null;

if( index != -1 ){ answer = data[index]; }

return answer;}

Page 21: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 21

containsKey

public boolean containsKey( Object key ){

}

Page 22: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 22

Example• Show state of Hashtable after the following

are performed (assume hashCode of an integer is the integer itself):– construct Hashtable with capacity 10– put( new Integer(29), “Barb” )– put ( new Integer(19), “Mateo” )– put( new Integer( 9 ), “Eddie” )– remove( new Integer(19) )– containsKey( new Integer(9) )– put( new Integer(30), “Jerry” )

Page 23: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 23

Linear probing and clustering

• In linear probing, when several keys hash to same index a “cluster” of values forms around the index.

• elements take longer to find/add because we must move linearly through entire cluster.

• elements are put farther and farther away from desired index.

• need other methods that avoid clustering.

Page 24: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 24

Double Hashing• The most common technique to avoid

clustering is double hashing:– use hash function hash1 to determine desired

index of element.– if collision occurs, use hash function hash2 to

determine next index to search for open spot.

• In particular, if index i is occupied, the next index to examine is: (i + hash2(key) ) % data.length

Page 25: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 25

choosing hash2• as we step through the array, we must ensure that every

array position is examined.• we must choose hash2 to prevent returning to original hash

index before visiting entire array.• Array capacity & hash2 value should be relatively prime.

One way to accomplish this:– choose data.length as a prime number and have hash2 return

values from range 1 .. data.length – 1

• Donald Knuth’s suggestion:– both data.length and data.length – 2 should be prime numbers

(called twin primes) e.g. 1231 and 1229– hash1(key) = Math.abs(key.hashCode()) % data.length– hash2(key) = 1 + (Math.abs(key.hashCode())%(data.length – 2)

Page 26: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 26

Chained Hashing• In chaining, we essentially allow collisions

to occur, and store more than one element at a given array index.

• How can we store more than one element?– list– ordered list– bst

• If the hash function equally distributes keys over the array, the chains at each index should be relatively short.

Page 27: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 27

Time Analysis

• Worst case for hashing is when all keys hash to same index (linear)

• Best case for hashing is when all keys hash to different indices (constant)

• Average case analysis gives a better picture of what happens in reality.

Page 28: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 28

Load Factor

• The load factor for a hash table is defined as:

array s table' theof size The

tablein the elements ofNumber

• For open-address hashing <= 1.

• For chaining, could be larger than 1.

Page 29: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 29

Average Time (Linear Probing)• In open-address hashing with linear

probing, a nonfull hash table and no removals, the average number of table elements examined is about

1

11

2

1

• For example. Suppose we have 800 items in a table of capacity 1000. How many entries will we examine on average?

Page 30: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 30

Average Time (Double Hashing)

• In open-address hashing with double hashing, a nonfull hash table, and no removals, the average number of elements examined is about:

• How many comparisons for previous example?

)1ln(

Page 31: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 31

Average Time (Chaining)• In open-address hashing with chained

hashing, the average number of table elements examined is about:

21

• How many for previous example?

Page 32: Chapter 11 Hash Anshuman Razdan Div of Computing Studies razdan@asu.edu razdan/cst230

CST 230 - Razdan et al. 32

Java Data Structures

• the java.util package includes the following classes (see http://java.sun.com/j2se/1.4.2/docs/api/ )– HashMap– Hashtable– LinkedList

• as well as interfaces:– Iterator– ListIterator