Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in...

36
Hash Tables CS 321 Spring 2019

Transcript of Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in...

Page 1: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Tables

CS 321 Spring 2019

Page 2: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Todays Topics

• Hash Tables

Page 3: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Trouble with Arrays

• What are some trouble with Arrays? – Arrays can only store data by a numeric index. – Arrays can waste a lot of space but are fast. – How can I store data by a more general ‘key’? – What is a real world example of an object sorted by

a non-numeric key with data associated? • Why?

– Keep track for a player in a game: • Shirt: diamond armor. • Legs: chainmail • Head: gold helmet.

Page 4: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Specific Goals for our Solution.

• Lookups should be very quick:

– O(1) if at all possible or as close as possible.

– As few steps as possible to find.

• Insert and Deletes should be fast: like arrays.

• We will assume that objects use unique keys:

– A key may be a single value.

– Or may be created from multiple values.

– We will only consider single value keys.

Page 5: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Common Solution: Hash Table

• A data structure that holds ‘values’ indexed by ‘keys’. Keys are usually strings.

• The location of the ‘value’ for a given ‘key’ is found by passing the ‘key’ to a ‘hash function’ that returns an index to the correct ‘value’.

• Hash Tables can be seen as a form of dictionary. • Generically Tables of Key/Value pairs. • Standard implementations are extremely

efficient: close to O(1) for all operations.

Page 6: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

What About Other Data Structues?

• Must have: Insert(), Delete() and Find():

• Arrays:

– can accomplish in O(1) time

– but are not space efficient (assumes we leave empty space for keys not currently in dictionary)

• Binary search trees

– can accomplish in O(log n) time- want faster.

– are space efficient.

• Hash Tables:

– With constraints is ~O(1) for Insert/Delete/Find

Page 7: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Example Array

• Use SSN for the key.

• Use an Array to hold:

– Use an array with range 0 - 999,999,999

– Using the SSN as a key, you have O(1) access to any person object

• Unfortunately, the number of active keys (Social Security Numbers) is much less than the array size (1 billion entries)

– Est. US population, Oct. 20th 2004: 294,564,209

– Over 60% of the array would be unused

– But would be fast and fit in memory.

Page 8: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Table Solution

• Hash on your SSN yields Index into a Table.

–Hash function must choose good index.

• Very Useful for

–When ID numbers are widely spread out

–When you don’t need access in ID order

– Fits our SSID example.

Page 9: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Table – abstract data type.

• Core methods for a Hash Table:

– Insert(key,value) – ~O(1), add key and value.

– Delete(key) – ~O(1), remove key and value.

– Search/Find(key) – ~O(1), find key and value in table.

• Internal method – critical method:

– Hash(key) – O(1), compute an index for the given key.

Page 10: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Tables – Conceptual View

Obj5

key=1

obj1

key=15

Obj4

key=2

Obj2

key=30

7

6

5

4

3

2

1

0

table

Obj3

key=4

buckets

hash

valu

e/i

nd

ex

Index = hash(key); 7 = hash(15);

Page 11: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash index/value

• A hash value or hash index is used to index the hash table (array)

• A hash function takes a key and returns a hash value/index

– The hash index is a integer (to index an array)

• The key is specific value associated with a specific object being stored in the hash table

– It is important that the key remain constant for the lifetime of the object

Page 12: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Functions & insert(…)

• Usage summary:

int hashValue = hashFunction (int key);

– Or hashValue = hashFunction (String key);

– Or hashValue = hashFunction (itemType item);

• Insert method:

public void insert (int key, itemType item)

{

hashValue = hashFunction (key);

table[hashValue] = item;

}

Page 13: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Function Requirements

• You want a hash function/algorithm that is:

– Fast

– Distributes keys throughout the table.

• Hash functions can use as input

– Integer key values

– String key values

– Multipart key values

• Multipart fields, and/or

• Multiple fields

Page 14: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Simple Hash Function: Mod

• Stands for modulo: Remainder of X/Y in integer arithmetic.

• Example Mod results.

– 8 mod 5 = 3

– 9 mod 5 = 4

– 10 mod 5 = 0

– 15 mod 5 = 0

• Key mod M = 0 if key = M*c

• What if M is prime and keys != M*c

Page 15: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Tables: Insert Example

For example, if we hash keys 0…1000 into a hash table with 5 entries and use h(key) =

key mod 5 , we get the following sequence of events:

0

1

2

3

4

key data

Insert 2

2 …

0

1

2

3

4

key data

Insert 21

2 …

21 …

0

1

2

3

4

key data

Insert 34

2 …

21 …

34 …

Insert 54

There is a collision at array entry #4

???

Page 16: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

• A problem arises when we have two keys that hash in the same array entry – this is called a collision.

• There are two ways to resolve collision: – Hashing with Chaining (a.k.a. “Separate Chaining”): every

hash table entry contains a pointer to a linked list of keys that hash in the same entry

– Hashing with Open Addressing: every hash table entry contains only one key. If a new key hashes to a table entry which is filled, systematically examine other table entries until you find one empty entry to place the new key

Dealing with Collisions

Page 17: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hashing with Chaining

The problem is that keys 34 and 54 hash in the same entry (4). We solve this collision by placing all keys that hash in the same hash table entry in a chain (linked list) or bucket (array) pointed by this entry:

0

1

2

3

4

other

key key data

Insert 54

2

21

54 34

CHAIN

0

1

2

3

4

Insert 101

2

21

54 34

101

Page 18: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hashing with Chaining

• What is the running time for insert/search/delete?

– Insert: It takes O(1) time to compute the hash function and insert at head of linked list

– Search: It is proportional to max linked list length

– Delete: Same as search

• Therefore, in the unfortunate event that we have a “bad” hash function all n keys may hash in the same table entry giving an O(n) run-time!

So how can we create a “good” hash function?

Page 19: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Choosing a Hash Function – 1

• Uniform Hashing = keys distributed throughout table.

• Choosing a good hash function requires taking into account the kind of data that will be used.

– The statistics of the key distribution needs to be accounted for

– E.g., Choosing the first letter of a last name will likely cause lots of collisions depending on the nationality of the population

• Many programming systems have hash functions built in

Page 20: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Choosing a Hash Function – 2

• Division/modulo method

– key mod m

– m is the array size; in general, it should be prime.

• Multiplication method

– Floor ((key*someFraction mod 1)*arraySize)

– Where some fraction is typically 0.618

• Java Hash Map method

– Create a “hash” by performing a series of shifts, adds, and xors on the key

– index = hash mod arraySize

Page 21: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Prime Number Distribution

• For example, assume • Keys (key values) are multiples

of 5 – 5, 10, 15, 20, 25…

• The keys are evenly distributed 5 to 245

• An M (the divisor) of 7

• Then, the hash values will be evenly distributed from 0 to 6 for the keys – See table

• If M was 5, then you would have what kind of distribution?

Key mod M Total

0 7

1 7

2 7

3 7

4 7

5 7

6 7

(blank)

Grand Total 49

hash value = key mod m

(m is typically the table size)

Page 22: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Choosing Hash Function – 3

• If keys are non-random – e.g. part numbers

– Use all data to contribute to the hash function to get a better distribution

– Consider folding – sum the natural (or arbitrary) groups of digits in key

– Don’t use redundant or non-data (.e.g. checksum values)

– Do not use information that might change!

• Analyze your expected key values (or some representative subset) to make sure your hash function gives a good distribution!

Page 23: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hashing with Open Addressing

• So far we have studied hashing with chaining, using a list to store the items that hash to the same location

• Another option is to store all the items (references to single items) directly in the table.

• Open addressing

– collisions are resolved by systematically examining other table indexes, i0 , i1 , i2 , … until an empty slot is located.

Page 24: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Tables – Open Addressing

Obj5

key=1

obj1

key=15

Obj4

key=2

Obj2

Key=28

7

6

5

4

3

2

1

0

table

Obj3

key=4 Index=4

hash

valu

e/i

nd

ex

Index=4

I = key mod 8

Page 25: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Open Addressing

• The key is first mapped to an array cell using the hash function (e.g. key % array-size)

• If there is a collision find an available array cell

• There are different algorithms to find (to probe for) the next array cell

– Linear – H+1,H+2,H+3,… until empty slot.

– Quadratic – H+1*1, H+2*2, H+3*3, H+4*4,…

– Double Hashing – hash again with a different hash function.

Page 26: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Probe Algorithms (Collision Resolution)

• Linear Probing – Choose the next available array cell

• First try arrayIndex = hash value + 1 • Then try arrayIndex = hash value + 2 • Be sure to wrap around the end of the array! • arrayIndex = (arrayIndex + 1) % arraySize • Stop when you have tried all possible array indices

– If the array is full, you need to throw an exception or, better yet, resize the array

• Quadratic Probing – Variation of linear probing that uses a more complex

function to calculate the next cell to try

Page 27: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Double Hashing

• Apply a second hash function after the first

• The second hash function, like the first, is dependent on the key

• Secondary hash function must – Be different than the first

– And, obviously, not generate a zero

• Good algorithm: – arrayIndex = (arrayIndex + stepSize) % arraySize;

– Where stepSize = constant – (key % constant)

– And constant is a prime less than the array size

Page 28: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Problems

• Linear Probing yields clusters.

• Quadratic Probing yields secondary clusters.

• Double hashing can avoid both. Depends on secondary hash function.

Page 29: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

The End

Page 30: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Load Factor

• Understanding the expected load factor will help you determine the efficiency of you hash table implementation and hash functions

• Load factor = number of items in hash table / array size

• For Open Addressing: – If < 0.5, wasting space

– If > 0.8, overflows significant

• For Chaining: – If < 1.0, wasting space

– If > 2.0, then search time to find a specific item may factor in significantly to the [relative] performance

Page 31: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Ave

rag

e #

of

pro

be

s

Load factor

Successful search

Linear probingDouble hashing

Separate chaining

Page 32: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

2

4

6

8

10

12

14

16

18

20

0 0.2 0.4 0.6 0.8 1

Ave

rag

e #

of

pro

be

s

Load factor

Unsuccessful search

Linear probingDouble hashing

Separate chaining

Page 33: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Open Addressing vs. Separate Chaining

• When should you be concerned about Open Addressing and Separate Chaining implementations?

• Note that there are Hash libraries… Java supports Hashtable, HashMap, LinkedHashMap, HashSet,…

• But, if you are implementing your own hash table consider: – Do you know the total number of items to be inserted into the table?

– Do you have plenty of memory?

– Do you know the expected load factor?

Page 34: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Key things in Hash Tables

1. The Quality of the Hash Function.

2. The conflict resolution method:

1. Separate chaining

2. Open Addressing.

3. Double hashing.

3. These two decisions determine the run time and memory performance of the Hash Tables.

Page 35: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

Hash Tables in Java

• Java supports a number of hash table classes – Hashtable, HashMap, LinkedHashMap, HashSet, …

– See Sun Java API Documentation http://java.sun.com/j2se/1.4.1/docs/api/

– Note that, like Vector and ArrayList, the items that are put into the hash tables are Objects

– Use Java casting when you remove items!

• As a programmer, you don’t see the collision detection, chaining, etc.

• You can set – The initial table size

– The load factor (Default is .75)

– hashCode() – hash function (also need to override equals()) for the item to be hashed

Page 36: Hash Tables, Binary Search Treescs.boisestate.edu/.../March06_hashtables_S19.pdf · Key things in Hash Tables 1. The Quality of the Hash Function. 2. The conflict resolution method:

The End