Bloom Filters and their Applications - Texas A&M...
Transcript of Bloom Filters and their Applications - Texas A&M...
1
Bloom Filters and their Applications
These slides were developed by -- and used with permission
from -- Shengquan Wang.
CPSC 662
Introduction
• Membership Query
Given a set S={x1, x2, …, xn} on a universe U, want to answer the query of
the form:
Is y!S ?
– Spell check
• Data structure
– Space
– Search time
• Hashing is one of the good candidates (randomized)
xi can be a long string
n can be a very large number
2
Hash Function
• It converts an input from a (typically) large domain into an output in a
(typically) smaller range
0
11
22
33
44
5
6
77
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
XXXXXXXXXXX
collision
H(x)
y ! H(y) ?
false positive
Examples of Simple Hash Functions
• Truncation: If students have an 9-digit identification number, take the last 3
digits as the table position
– e.g. 925371622 becomes 622
• Folding: Split a 9-digit number into three 3-digit numbers, and add them
– e.g. 925371622 becomes 925 + 376 + 622 = 1923
• Modular arithmetic: If the table size is 1000, the first example always
keeps within the table range, but the second example does not (it should be
mod 1000)
– e.g. 1923 mod 1000 = 923 (1923 % 1000)
3
Hashing Performance
• Hash each element of the set to b number of bits,
with b = 2 log2 n
– The probability that two elements collide is 1/n2.
– False positive probability = 1/n
(Asymptotically vanishing probability of error)
– Binary search time = O(log2 n)
– Space = !(n log2 n)
Bloom Filters
• Generalized randomized data structure
• Invented by Burton Bloom in 1970
• Basic idea: Use m-bit array to represent a set with n elements with k hashing
functions
• Bloom filter provides a answer in
– “Constant” search time (time to hash).
– Small amount of space.
– But with some probability of being wrong
B. Bloom, “Space/time tradeoffs in hash coding with allowable errors,”CACM 13 (1970).
4
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B
Example
• Start with an m bit array, filled with 0s
• Hash each item xj ! S into [1,…,m], k number of times.
If Hi(xj) = a ! [1,…,m], then set B[a] = 1
• To check if y ! S, check if all Hi(y) are ones
• False positive: All Hi(y) are ones, but y not in S
Example
1 2 3 4 5 6 7 8 9 10 11 12 =m
X1X2
h1 h2 h3
Y1Y3 False PositiveY2
x1 -> {2, 5, 9} x2 -> {5, 7, 11}
5
• Notation:
– n = number of elements in the set to be represented
– m = size of the bloom filter
– k = number of hash functions
• Probability that a bit is still zero after all elements are hashed into the Bloom
filter
• Probability of a false positive
Probabilities
1 0 1 0 0 1 1 1 0 1 1 0
Determining the value of k
• Goal: Optimize k that minimizes false positive rate
• Optimal result: k = (ln 2)m/n ! f = (0.6185)m/n
– m = number of bits in bloom filter
– n = number of elements in the set
6
Example
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0 1 2 3 4 5 6 7 8 9 10
Hash functions
Fa
lse p
osi
tiv
e r
ate
m/n = 8
Opt k = 8 ln 2 = 5.45...
Tradeoffs
• Three parameters.
– Size m/n : bits per item.
– Time k : number of hash functions.
– Error f : false positive probability.
False positive probability decreases
exponentially with linear increase in the
number of hash functions & space
7
Comparison
n * (m/n)!(n log2 n)spacespace
(1-e–k n/m)k (! 0.02)1/nfalse false postivepostive rate (f) rate (f)
O(k)O(log2 n)Lookup timeLookup time
tradeoff between m/n
and f
k = 1
m/n (m/n = 8)2 log2 nbit per elementbit per element
Bloom filtersHashing
Application: Distributed Caching
• Send Bloom filters of URLs
• False positives do not hurt much
– Get errors from cache
changes anyway
L. Fan, P. Cao, J. Almeida and A.Z. Broder “Summary Cache: A scalable wide-areaWeb cache sharing protocol” IEEE/ACM Transactions on Networking 2000
Web Cache 1 Web Cache 2 Web Cache 3
Web Cache 6Web Cache 5Web Cache 4
8
Example
http://www.perl.com/pub/a/2004/04/08/bloom_filters.html
http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html
http://www.flipcode.com/articles/article_bloomfilters.shtml
http://loaf.cantbedone.org/about.htm
http://www.cap-lore.com/code/BloomTheory.html
http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf
http://lemonodor.com/archives/000881.html
http://citeseer.ist.psu.edu/mitzenmacher01compressed.html
J. Byers, J. Considine, M. Mitzenmacher, S. Rost, “Informed ContentDelivery Across Adaptive Overlay Networks” SIGCOMM 2002
Application: Set Reconciliation for Content Delivery
• Suppose two hosts A and B have SA and SB
• A wants to know SA-SB so that it can send those documents to B, that B does
not have
• B sends Bloom filter corresponding to SB
• A sends its documents which are not in that bloom filter
• False positives: approximate
9
• Let HA, HB be hosts responsible for keywords A and B respectively
• Suppose we want documents having both keywords A and B ! FIND
SA∩SB
• Steps:
– HA sends Bloom filter corresponding to SA to HB
– HB computes approximate SA∩SB and sends back to HA
• False positives : HA can find out, so no problem
P. Reynolds and A. Vahdat, “Efficient Peer-to-peer keyword searching”
Application: Set Intersection for Keyword Search
Application: Moderate-sized P2P networks
• Distributed hash tables for scalability
• For moderate sized P2P network – per-node Bloom filter
– Use 8 or 16 bits per object instead of 64 bit identifiers
– False positives : Not much problem
F. M. Cuena-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen,“PlanetP: Using gossiping to build content addressablepeer-to-peer information sharing communities.”
10
Application: Resource Routing
A
CB D E
GF H J KI
L M N
Sb , Sf , Sg , Sh
• Network has tree topology.
• B has bloom filters for all children
sub -trees collectively and also for
each child sub-tree individually.
S. Rhea and J. Kubiatowicz, “Probabilistic Location and Routing”INFOCOMM 2002
B. Gronvall “Scalable Multicast Forwarding” SIGCOMM 2002
Application: Multicast
• Typically routers maintain a list of interfaces for each multicast address
• An Efficient Solution: Keep list of addresses for each interface and useBloom filter to represent these addresses
– Parallelizable
• False Positives: Not bad, just wastes some resources
11
Application: Detecting Routing Loops
• Current mechanism: TTL
• Each packet contain a small Bloom filter to track the nodes visited
– If filter does not change at a node, then a possible loop !!
• False positives: Problematic
A. Whitaker and D. Wetherall “Forwarding without Loops in Icarus”OPENARCH 2002
Application: IP Traceback
• Use Bloom filters to record the packets seen by each router
• False positives:
– Router mistakenly identifies packet as having been seen
– Multiple possible paths
A.C. Snoeren, C. Partridge, L.A. Sanchez, C.E. Jones, F. Tchakountio,S.T.Kent and W.T. Strayer “Hash-based IP traceback” SIGCOMM 2001
12
Summary
• The Bloom Filter Principle:
Wherever a list or set is used, and space is a consideration, a Bloom filter
should be considered. When using a Bloom filter, consider the potential
effects of false positives.
• Space/time tradeoffs in hash coding with allowable errors. B. Bloom.CACM 13 (1970).
• Network Applications of Bloom Filters: A Survey. A. Broder and M.Mitzenmacher. Allerton Conference 2002.
• Compressed Bloom Filters. M. Mitzenmacher. PODC 2001.• Spectral Bloom Filters. S. Cohen and Y. Matias. SIGMOD 2003.• The Bloomier Filter: An Efficient Data Structure for Static Support
Lookup Tables. B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. SODA 2004
References