םינותנ ינבמ - cs.bgu.ac.ilds122/wiki.files/Presentation_hash.pdf · • Hashing by...

מבני נתונים

טבלת גיבוב

Tzachi (Isaac) Rosen 1

Motivation

• Many applications require a dynamic set that supports only the dictionary operations: – insert

– search

– delete

• Example: – A symbol table in a compiler.

• Keys

• Satellite data


Keys

• We will consider all keys to be (possibly large) natural numbers.

• How can we convert floats or ASCII strings to natural numbers?

• Example: – Consider “CLRS “

• ASCII values: C = 67, L = 76, R = 82, S = 83.

– There are 128 basic ASCII values. – So interpret “CLRS”

• (67 ・ 1283)+ (76 ・ 1282)+ (82 ・ 1281)+ (83 ・ 1280) = 141,764,947.


• Suppose: – The range of keys is 0 .. m-1 – Keys are distinct

• The idea: – Set up an array T[0..m-1] in which

• T[i] = x if x T and key[x] = i • T[i] = null otherwise

– Operations take O(1) time! • search (T, k) return T[k] • insert (T, x) T[key[x]] = x • delete (T, x) T[key[x]] = null

• So – what’s the problem?

Direct Addressing


Direct Addressing

Tzachi (Isaac) Rosen

• Direct addressing works well when the range m of keys is relatively small

• But what if the keys are 32-bit integers?

– It will have 232 entries

– Even if memory is not an issue, it takes a lot of time to initialize it

5

Hashing

• Solution:

– map keys to smaller range 0 .. m-1

• This mapping is called a hashing


Collisions

• Two hashed key may collide with one another

• Solution:

– chaining

– open addressing


Chaining

• Chaining puts elements that hash to the same slot in a linked list.


Search, Insert and Delete

• search(T, k)

– search for an element with key k in list T[h(k)]

• insert(T, x)

– insert x at the head of list T[h(key[x])]

• delete(T, x)

– delete x from the list T[h(key[x])]


Analysis of Chaining

• Assume simple uniform hashing: – each key is equally likely to be hashed to any slot.

• Given n keys and m slots in the table: the load factor = n/m is the average # keys per slot

• We will show that the average cost of an unsuccessful search for a key is Θ(1+).

• We will show that the average cost of a successful search is Θ(1+/2) = Θ(1+).

• Hence, the average cost is Θ(1+). • Thus, if n = O(m), α = n/m = O(m)/m = O(1), and the

average cost is O(1)



• Theorem: – An unsuccessful search takes expected time Θ(1+α).

• Proof: – Simple uniform hashing ⇒ any key not already in the table

is equally likely to hash to any of the m slots. – To search unsuccessfully for any key k, need to search to

the end of the list T[h(k)]. – This list has expected length E[length of T[h(k)]] = α. – Therefore, the expected number of elements examined in

an unsuccessful search is α. – Adding in the time to compute the hash function, the total

time required is Θ(1 + α).



• Theorem: – An successful search takes expected time Θ(1+α).

• Proof: – Assume that the element x being searched for is equally likely to be

any of the n elements stored in the table. – The number of elements examined during a successful search for x is 1

more than the number of elements that appear before x in x’s list. – These are the elements inserted after x was inserted (because we

insert at the head of the list). – So we need to find the average, over the n elements x in the table, of

how many elements were inserted into x’s list after x was inserted. – For i = 1, 2, . . . , n, let xi be the i th element inserted into the table, and

let ki = key[xi]. – For all i and j, define indicator random variable Xi j = I{h(ki) = h(kj)}.


Choosing A Hash Function

• Clearly choosing the hash function well is crucial

– What will a worst-case hash function do?

– What will be the time to search in this case?

• What are desirable features of the hash function?

– Should distribute keys uniformly into slots

– Should not depend on patterns in the data


Choosing A Hash Function

• Unfortunately, it is typically not possible to check this conditions – One rarely knows the probability distribution

according to which the keys are drawn – The keys may not be drawn independently.

• Occasionally we do know the distribution – If the keys are known to be random real numbers k

independently and uniformly distributed in the range 0≤k<1

– The hash function h(k) = ⌊km⌋ satisfies the condition of simple uniform hashing.


The Division Method

h(k) = k mod m

• For example, if the hash table has size m = 12 and the key is k = 100,

then h(k) = 4. • Hashing by division is quite fast. • What happens if m is a power of 2 (say 2P)? • What about prime numbers?

– Claim: • If m is a prime and x mod m ≠ 0 then for all i = 1 .. m-1, ix mod m are different

and non zero

– Proof: • If ix mod m = 0, since m is prime, it divides either i or x. • ix mod m ≠ 0 since x mod m ≠ 0 and i < m. • For i > j, ix mod m ≠ jx mod m, since otherwise (i-j)x mod m = 0, but 0 < i-j < m

and x mod m ≠ 0.


The Division Method

• Conclusion:

– Pick m to be a prime not too close to a power of 2

• For example

– Suppose n = 2000 and we don’t mind = 3

– We can pick h(k) = k mod 701

• 701 is a prime near 2000/3 but not near any power of 2


The Multiplication Method

For a constant 0 < A < 1: h(k) = m (kA - kA)

• Disadvantage:

– Slower than division method.

• Advantage: – Value of m is not critical. – (Relatively) easy implementation:

• Choose m = 2P, 0 < p < w.

• Choose A = s/2w, where w is k’s number of bits, and s is an integer in the range 0 < s < 2w

• Knuth: – Good choice for A ≈ (5 - 1)/2


Example

• Let k = 123456, p = 14, m = 214 = 16384, and w = 32.

• Chose A = 2654435769/232 (of a form s/232, close to (5 - 1)/2.

• Then, k · s = 327706022297664 = (76300 · 232) + 17612864

• So, r1 = 76300 and r0 = 17612864.

• Hence, the 14 most significant bits of r0 yield the value h(k) = 67


Universal Hashing

• Peak a hash function randomly when the algorithm begins

– A way to randomize the algorithm to control the input distribution.

– Need a good family of hash functions to choose from


Universal Set

• A finite collection H of hash functions is universal

– if for each k, l ∈ U, where k ≠ l, the number of hash functions h ∈ H for which h(k) = h(l) is ≤|H|/m, where m is the size of the hash table.

• Alternatively, H is universal

– if, with a hash function h chosen randomly from H, the probability of a collision between two different keys is no more than 1/m.


Universal Hashing

• Theorem: – Choose h from a universal family of hash functions – Hash n keys into a table T of m slots, n m – Then the expected number of collisions involving a particular key x is less than 1

• Proof: – For each pair of keys x, y, let Ixy the indicator that y and x collide. – E[Ixy] = 1/m – Let Cx be total number of collisions involving key x

–

– Since n m, we have E[Cx] < 1 • Corollary

– Using chaining and universal hashing, the expected time for each search operation is O(1).


m

1n]E[I]IE[]E[C

xyTy

xy

xyTy

xyx

22

A Universal Hash Set of Functions

• Let U = {1, …, u}

• Fix a prime p > u.

• For every a є {1, …, p-1}, b є {0, …, p-1}, define

hab(k) = ((ak + b) mod p) mod m, where m is the size of the table

• H = {hab(k) } is universal


Open Addressing


• Basic idea: – If slot is full, try another slot, etc., until an open slot is

found (probing)

– To search, follow same sequence of probes as would be used when inserting the element

• If reach element with correct key, return it

• If reach a null pointer, element is not in table

• Good for fixed sets (adding but no deletion) – Example: spell checking

• Table needn’t be much bigger than n – But, comparing to chaining, there is more space available.

24

Hash Function

such that

for each k є U

{h(k,0),h(k,1), ..., h(k, m-1)}

is a permutation of

{0, 1, ..., m -1}


Insert & Search


Deletion

• Cannot just put null into the slot containing the key we want to delete.

• Solution:

– Use a special value DELETED when marking a slot as empty during deletion.

• The disadvantage:

– Search time is no longer dependent on the load factor α.


Linear Probing

Let

h' : U → {0, 1, ..., m - 1} be an hash function

Define

h(k, i) = (h'(k) + i) mod m

• Easy to implement

• Suffers from a primary clustering – Long runs of occupied sequences build up.


Quadratic Probing

h(k, i) = (h'(k) + c1i + c2i2) mod m,

where h' is an hash function,

c1 and c2 ≠ 0 are constants, and i = 0, 1, ..., m - 1

• Works much better than linear probing. • Suffers from a secondary clustering.

– Two distinct keys with same h’ value have same probe sequence.


Example

• h(k, i) = (h(k) + i + i2)(mod m)

– {h(k), h(k) + 2, h(k) + 6, ...}

• For m = 2n, a good choice for the constants are c1 = c2 = ½

– As the values h(k, i) for i є [0, m − 1] are all distinct.

– Probe sequence: {h(k), h(k) + 1, h(k) + 3, h(k) + 6, ...}

• For prime m > 2, most choices of c1 and c2 will make h(k, i)

be distinct for i є [0, (m − 1) / 2]. – Such choices include c1 = c2 = 1/2, c1 = c2 = 1, and c1 = 0,c2 = 1. – Difficult to guarantee that insertions will succeed when the load

factor is > 1/2.


Double hashing

h(k, i) = (h1(k) + ih2(k)) mod m

where

h1 and h2 are hash functions

• Advantage:

– Two keys with same hash may have different steps.

– Only m2 different probe sequences.


Example

• h1(k) = k mod 13

• h2(k) = 1 + (k mod 11)


Analysis of Open-Address Hashing

• Theorem: – Given an open-address hash table with load factor α <

1, – The expected number of probes in a successful

search is at most

• assuming uniform hashing – Each key is equally likely to have any of the m! permutations of

0, 1, . . . , m − 1 as its probe sequence

• assuming that each key in the table is equally likely to be searched for.


Analysis of Open-Address Hashing

• Theorem: – Given an open-address hash table with load factor

α < 1, – The expected number of probes in an unsuccessful

search is at most 1/(1-α), • assuming uniform hashing.

• If α is a constant, a search runs in O(1) time.

• Theorem: – The expected number of probes to insert is at most

1/(1 − α).


בודק שגיאות כתיב מהיר

עם פונקצית 1M bitsבת Tוטבלת ערבול , ערכים 10,000בן מילוןנניח שיש לנו • .hערבול

.באחת T[h(w)]ונסמן את הכניסה h(w)במילון נחשב wלכל מלה • עשויה לתת ערך זהה למספר מלים h -נשים לב ש–

.תסומן באפס h(w) = i -כך ש wשלא קיימת עבורה מלה T[i]כניסה •

י בדיקה "השייכת למילון נוכל לבדוק האם כתבנו אותה נכון ע ’wלכשנכתוב מלה • .T[h(w’)] = 1האם

.ברור שאם כתבנו את המלה נכון נקבל תשובה חיובית•

?(false positive)מה ההסתברות שלא כתבנו נכון וקיבלנו תשובה חיובית • .h(w’)תיתן את ’w -של מלים שונות מ h -כהסתברות שה–

1/100 = 10,000/1,000,000מהנחת ההתפלגות האחידה ההסתברות היא –

?האם ניתן לשפר•ורק אם האיות , ביטים כל אחת 100,000טבלאות ערבול בגודל של 10-נשתמש ב–

.מאושר בכל הטבלאות נאשר אותו

1/10 = 10,000/100,000בכל אחת מהטבלאות הוא false positive -ההסתברות ל–

.1/1010הטבלאות גם יחד הוא 10בכל false positive -ההסתברות ל–


םינותנ ינבמ - cs.bgu.ac.ilds122/wiki.files/Presentation_hash.pdf · • Hashing by...

Documents

Transcript of םינותנ ינבמ - cs.bgu.ac.ilds122/wiki.files/Presentation_hash.pdf · • Hashing by...