םינותנ ינבמ - cs.bgu.ac.ilds122/wiki.files/Presentation_hash.pdf · • Hashing by...
Transcript of םינותנ ינבמ - cs.bgu.ac.ilds122/wiki.files/Presentation_hash.pdf · • Hashing by...
מבני נתונים
טבלת גיבוב
Tzachi (Isaac) Rosen 1
Motivation
• Many applications require a dynamic set that supports only the dictionary operations: – insert
– search
– delete
• Example: – A symbol table in a compiler.
• Keys
• Satellite data
Tzachi (Isaac) Rosen 2
Keys
• We will consider all keys to be (possibly large) natural numbers.
• How can we convert floats or ASCII strings to natural numbers?
• Example: – Consider “CLRS “
• ASCII values: C = 67, L = 76, R = 82, S = 83.
– There are 128 basic ASCII values. – So interpret “CLRS”
• (67 ・ 1283)+ (76 ・ 1282)+ (82 ・ 1281)+ (83 ・ 1280) = 141,764,947.
Tzachi (Isaac) Rosen 3
• Suppose: – The range of keys is 0 .. m-1 – Keys are distinct
• The idea: – Set up an array T[0..m-1] in which
• T[i] = x if x T and key[x] = i • T[i] = null otherwise
– Operations take O(1) time! • search (T, k) return T[k] • insert (T, x) T[key[x]] = x • delete (T, x) T[key[x]] = null
• So – what’s the problem?
Direct Addressing
Tzachi (Isaac) Rosen 4
Direct Addressing
Tzachi (Isaac) Rosen
• Direct addressing works well when the range m of keys is relatively small
• But what if the keys are 32-bit integers?
– It will have 232 entries
– Even if memory is not an issue, it takes a lot of time to initialize it
5
Hashing
• Solution:
– map keys to smaller range 0 .. m-1
• This mapping is called a hashing
Tzachi (Isaac) Rosen 6
Collisions
• Two hashed key may collide with one another
• Solution:
– chaining
– open addressing
Tzachi (Isaac) Rosen 7
Chaining
• Chaining puts elements that hash to the same slot in a linked list.
Tzachi (Isaac) Rosen 8
Search, Insert and Delete
• search(T, k)
– search for an element with key k in list T[h(k)]
• insert(T, x)
– insert x at the head of list T[h(key[x])]
• delete(T, x)
– delete x from the list T[h(key[x])]
Tzachi (Isaac) Rosen 9
Analysis of Chaining
• Assume simple uniform hashing: – each key is equally likely to be hashed to any slot.
• Given n keys and m slots in the table: the load factor = n/m is the average # keys per slot
• We will show that the average cost of an unsuccessful search for a key is Θ(1+).
• We will show that the average cost of a successful search is Θ(1+/2) = Θ(1+).
• Hence, the average cost is Θ(1+). • Thus, if n = O(m), α = n/m = O(m)/m = O(1), and the
average cost is O(1)
Tzachi (Isaac) Rosen 10
Analysis of Chaining
• Theorem: – An unsuccessful search takes expected time Θ(1+α).
• Proof: – Simple uniform hashing ⇒ any key not already in the table
is equally likely to hash to any of the m slots. – To search unsuccessfully for any key k, need to search to
the end of the list T[h(k)]. – This list has expected length E[length of T[h(k)]] = α. – Therefore, the expected number of elements examined in
an unsuccessful search is α. – Adding in the time to compute the hash function, the total
time required is Θ(1 + α).
Tzachi (Isaac) Rosen 11
Analysis of Chaining
• Theorem: – An successful search takes expected time Θ(1+α).
• Proof: – Assume that the element x being searched for is equally likely to be
any of the n elements stored in the table. – The number of elements examined during a successful search for x is 1
more than the number of elements that appear before x in x’s list. – These are the elements inserted after x was inserted (because we
insert at the head of the list). – So we need to find the average, over the n elements x in the table, of
how many elements were inserted into x’s list after x was inserted. – For i = 1, 2, . . . , n, let xi be the i th element inserted into the table, and
let ki = key[xi]. – For all i and j, define indicator random variable Xi j = I{h(ki) = h(kj)}.
Tzachi (Isaac) Rosen 12
Analysis of Chaining
Tzachi (Isaac) Rosen 13
Choosing A Hash Function
• Clearly choosing the hash function well is crucial
– What will a worst-case hash function do?
– What will be the time to search in this case?
• What are desirable features of the hash function?
– Should distribute keys uniformly into slots
– Should not depend on patterns in the data
Tzachi (Isaac) Rosen 14
Choosing A Hash Function
• Unfortunately, it is typically not possible to check this conditions – One rarely knows the probability distribution
according to which the keys are drawn – The keys may not be drawn independently.
• Occasionally we do know the distribution – If the keys are known to be random real numbers k
independently and uniformly distributed in the range 0≤k<1
– The hash function h(k) = ⌊km⌋ satisfies the condition of simple uniform hashing.
Tzachi (Isaac) Rosen 15
The Division Method
h(k) = k mod m
• For example, if the hash table has size m = 12 and the key is k = 100,
then h(k) = 4. • Hashing by division is quite fast. • What happens if m is a power of 2 (say 2P)? • What about prime numbers?
– Claim: • If m is a prime and x mod m ≠ 0 then for all i = 1 .. m-1, ix mod m are different
and non zero
– Proof: • If ix mod m = 0, since m is prime, it divides either i or x. • ix mod m ≠ 0 since x mod m ≠ 0 and i < m. • For i > j, ix mod m ≠ jx mod m, since otherwise (i-j)x mod m = 0, but 0 < i-j < m
and x mod m ≠ 0.
Tzachi (Isaac) Rosen 16
The Division Method
• Conclusion:
– Pick m to be a prime not too close to a power of 2
• For example
– Suppose n = 2000 and we don’t mind = 3
– We can pick h(k) = k mod 701
• 701 is a prime near 2000/3 but not near any power of 2
Tzachi (Isaac) Rosen 17
The Multiplication Method
For a constant 0 < A < 1: h(k) = m (kA - kA)
• Disadvantage:
– Slower than division method.
• Advantage: – Value of m is not critical. – (Relatively) easy implementation:
• Choose m = 2P, 0 < p < w.
• Choose A = s/2w, where w is k’s number of bits, and s is an integer in the range 0 < s < 2w
• Knuth: – Good choice for A ≈ (5 - 1)/2
Tzachi (Isaac) Rosen 18
Example
• Let k = 123456, p = 14, m = 214 = 16384, and w = 32.
• Chose A = 2654435769/232 (of a form s/232, close to (5 - 1)/2.
• Then, k · s = 327706022297664 = (76300 · 232) + 17612864
• So, r1 = 76300 and r0 = 17612864.
• Hence, the 14 most significant bits of r0 yield the value h(k) = 67
Tzachi (Isaac) Rosen 19
Universal Hashing
• Peak a hash function randomly when the algorithm begins
– A way to randomize the algorithm to control the input distribution.
– Need a good family of hash functions to choose from
Tzachi (Isaac) Rosen 20
Universal Set
• A finite collection H of hash functions is universal
– if for each k, l ∈ U, where k ≠ l, the number of hash functions h ∈ H for which h(k) = h(l) is ≤|H|/m, where m is the size of the hash table.
• Alternatively, H is universal
– if, with a hash function h chosen randomly from H, the probability of a collision between two different keys is no more than 1/m.
Tzachi (Isaac) Rosen 21
Universal Hashing
• Theorem: – Choose h from a universal family of hash functions – Hash n keys into a table T of m slots, n m – Then the expected number of collisions involving a particular key x is less than 1
• Proof: – For each pair of keys x, y, let Ixy the indicator that y and x collide. – E[Ixy] = 1/m – Let Cx be total number of collisions involving key x
–
– Since n m, we have E[Cx] < 1 • Corollary
– Using chaining and universal hashing, the expected time for each search operation is O(1).
Tzachi (Isaac) Rosen
m
1n]E[I]IE[]E[C
xyTy
xy
xyTy
xyx
22
A Universal Hash Set of Functions
• Let U = {1, …, u}
• Fix a prime p > u.
• For every a є {1, …, p-1}, b є {0, …, p-1}, define
hab(k) = ((ak + b) mod p) mod m, where m is the size of the table
• H = {hab(k) } is universal
Tzachi (Isaac) Rosen 23
Open Addressing
Tzachi (Isaac) Rosen
• Basic idea: – If slot is full, try another slot, etc., until an open slot is
found (probing)
– To search, follow same sequence of probes as would be used when inserting the element
• If reach element with correct key, return it
• If reach a null pointer, element is not in table
• Good for fixed sets (adding but no deletion) – Example: spell checking
• Table needn’t be much bigger than n – But, comparing to chaining, there is more space available.
24
Hash Function
such that
for each k є U
{h(k,0),h(k,1), ..., h(k, m-1)}
is a permutation of
{0, 1, ..., m -1}
Tzachi (Isaac) Rosen 25
Insert & Search
Tzachi (Isaac) Rosen 26
Deletion
• Cannot just put null into the slot containing the key we want to delete.
• Solution:
– Use a special value DELETED when marking a slot as empty during deletion.
• The disadvantage:
– Search time is no longer dependent on the load factor α.
Tzachi (Isaac) Rosen 27
Linear Probing
Let
h' : U → {0, 1, ..., m - 1} be an hash function
Define
h(k, i) = (h'(k) + i) mod m
• Easy to implement
• Suffers from a primary clustering – Long runs of occupied sequences build up.
Tzachi (Isaac) Rosen 28
Quadratic Probing
h(k, i) = (h'(k) + c1i + c2i2) mod m,
where h' is an hash function,
c1 and c2 ≠ 0 are constants, and i = 0, 1, ..., m - 1
• Works much better than linear probing. • Suffers from a secondary clustering.
– Two distinct keys with same h’ value have same probe sequence.
Tzachi (Isaac) Rosen 29
Example
• h(k, i) = (h(k) + i + i2)(mod m)
– {h(k), h(k) + 2, h(k) + 6, ...}
• For m = 2n, a good choice for the constants are c1 = c2 = ½
– As the values h(k, i) for i є [0, m − 1] are all distinct.
– Probe sequence: {h(k), h(k) + 1, h(k) + 3, h(k) + 6, ...}
• For prime m > 2, most choices of c1 and c2 will make h(k, i)
be distinct for i є [0, (m − 1) / 2]. – Such choices include c1 = c2 = 1/2, c1 = c2 = 1, and c1 = 0,c2 = 1. – Difficult to guarantee that insertions will succeed when the load
factor is > 1/2.
Tzachi (Isaac) Rosen 30
Double hashing
h(k, i) = (h1(k) + ih2(k)) mod m
where
h1 and h2 are hash functions
• Advantage:
– Two keys with same hash may have different steps.
– Only m2 different probe sequences.
Tzachi (Isaac) Rosen 31
Example
• h1(k) = k mod 13
• h2(k) = 1 + (k mod 11)
Tzachi (Isaac) Rosen 32
Analysis of Open-Address Hashing
• Theorem: – Given an open-address hash table with load factor α <
1, – The expected number of probes in a successful
search is at most
• assuming uniform hashing – Each key is equally likely to have any of the m! permutations of
0, 1, . . . , m − 1 as its probe sequence
• assuming that each key in the table is equally likely to be searched for.
Tzachi (Isaac) Rosen 33
Analysis of Open-Address Hashing
• Theorem: – Given an open-address hash table with load factor
α < 1, – The expected number of probes in an unsuccessful
search is at most 1/(1-α), • assuming uniform hashing.
• If α is a constant, a search runs in O(1) time.
• Theorem: – The expected number of probes to insert is at most
1/(1 − α).
Tzachi (Isaac) Rosen 34
בודק שגיאות כתיב מהיר
עם פונקצית 1M bitsבת Tוטבלת ערבול , ערכים 10,000בן מילוןנניח שיש לנו • .hערבול
.באחת T[h(w)]ונסמן את הכניסה h(w)במילון נחשב wלכל מלה • עשויה לתת ערך זהה למספר מלים h -נשים לב ש–
.תסומן באפס h(w) = i -כך ש wשלא קיימת עבורה מלה T[i]כניסה •
י בדיקה "השייכת למילון נוכל לבדוק האם כתבנו אותה נכון ע ’wלכשנכתוב מלה • .T[h(w’)] = 1האם
.ברור שאם כתבנו את המלה נכון נקבל תשובה חיובית•
?(false positive)מה ההסתברות שלא כתבנו נכון וקיבלנו תשובה חיובית • .h(w’)תיתן את ’w -של מלים שונות מ h -כהסתברות שה–
1/100 = 10,000/1,000,000מהנחת ההתפלגות האחידה ההסתברות היא –
?האם ניתן לשפר•ורק אם האיות , ביטים כל אחת 100,000טבלאות ערבול בגודל של 10-נשתמש ב–
.מאושר בכל הטבלאות נאשר אותו
1/10 = 10,000/100,000בכל אחת מהטבלאות הוא false positive -ההסתברות ל–
.1/1010הטבלאות גם יחד הוא 10בכל false positive -ההסתברות ל–
Tzachi (Isaac) Rosen 35
בודק שגיאות כתיב מהיר
עם פונקצית 1M bitsבת Tוטבלת ערבול , ערכים 10,000בן מילוןנניח שיש לנו • .hערבול
.באחת T[h(w)]ונסמן את הכניסה h(w)במילון נחשב wלכל מלה • עשויה לתת ערך זהה למספר מלים h -נשים לב ש–
.תסומן באפס h(w) = i -כך ש wשלא קיימת עבורה מלה T[i]כניסה •
י בדיקה "השייכת למילון נוכל לבדוק האם כתבנו אותה נכון ע ’wלכשנכתוב מלה • .T[h(w’)] = 1האם
.ברור שאם כתבנו את המלה נכון נקבל תשובה חיובית•
?(false positive)מה ההסתברות שלא כתבנו נכון וקיבלנו תשובה חיובית • .h(w’)תיתן את ’w -של מלים שונות מ h -כהסתברות שה–
1/100 = 10,000/1,000,000מהנחת ההתפלגות האחידה ההסתברות היא –
?האם ניתן לשפר•ורק אם האיות , ביטים כל אחת 100,000טבלאות ערבול בגודל של 10-נשתמש ב–
.מאושר בכל הטבלאות נאשר אותו
1/10 = 10,000/100,000בכל אחת מהטבלאות הוא false positive -ההסתברות ל–
.1/1010הטבלאות גם יחד הוא 10בכל false positive -ההסתברות ל–
Tzachi (Isaac) Rosen 36