Bloom Filters and their Applications - Texas A&M...

1

Bloom Filters and their Applications

These slides were developed by -- and used with permission

from -- Shengquan Wang.

CPSC 662

Introduction

• Membership Query

Given a set S={x1, x2, …, xn} on a universe U, want to answer the query of

the form:

Is y!S ?

– Spell check

• Data structure

– Space

– Search time

• Hashing is one of the good candidates (randomized)

xi can be a long string

n can be a very large number

2

Hash Function

• It converts an input from a (typically) large domain into an output in a

(typically) smaller range

0

11

22

33

44

5

6

77

XXXXXXXXXXX

XXXXXXXXXXX

XXXXXXXXXXX

XXXXXXXXXXX

XXXXXXXXXXX

collision

H(x)

y ! H(y) ?

false positive

Examples of Simple Hash Functions

• Truncation: If students have an 9-digit identification number, take the last 3

digits as the table position

– e.g. 925371622 becomes 622

• Folding: Split a 9-digit number into three 3-digit numbers, and add them

– e.g. 925371622 becomes 925 + 376 + 622 = 1923

• Modular arithmetic: If the table size is 1000, the first example always

keeps within the table range, but the second example does not (it should be

mod 1000)

– e.g. 1923 mod 1000 = 923 (1923 % 1000)

3

Hashing Performance

• Hash each element of the set to b number of bits,

with b = 2 log2 n

– The probability that two elements collide is 1/n2.

– False positive probability = 1/n

(Asymptotically vanishing probability of error)

– Binary search time = O(log2 n)

– Space = !(n log2 n)

Bloom Filters

• Generalized randomized data structure

• Invented by Burton Bloom in 1970

• Basic idea: Use m-bit array to represent a set with n elements with k hashing

functions

• Bloom filter provides a answer in

– “Constant” search time (time to hash).

– Small amount of space.

– But with some probability of being wrong

B. Bloom, “Space/time tradeoffs in hash coding with allowable errors,”CACM 13 (1970).

4

0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

0 1 0 0 1 0 1 0 0 1 1 1 0 1 1 0B

Example

• Start with an m bit array, filled with 0s

• Hash each item xj ! S into [1,…,m], k number of times.

If Hi(xj) = a ! [1,…,m], then set B[a] = 1

• To check if y ! S, check if all Hi(y) are ones

• False positive: All Hi(y) are ones, but y not in S

Example

1 2 3 4 5 6 7 8 9 10 11 12 =m

X1X2

h1 h2 h3

Y1Y3 False PositiveY2

x1 -> {2, 5, 9} x2 -> {5, 7, 11}

5

• Notation:

– n = number of elements in the set to be represented

– m = size of the bloom filter

– k = number of hash functions

• Probability that a bit is still zero after all elements are hashed into the Bloom

filter

• Probability of a false positive

Probabilities

1 0 1 0 0 1 1 1 0 1 1 0

Determining the value of k

• Goal: Optimize k that minimizes false positive rate

• Optimal result: k = (ln 2)m/n ! f = (0.6185)m/n

– m = number of bits in bloom filter

– n = number of elements in the set

6

Example

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

0.1

0 1 2 3 4 5 6 7 8 9 10

Hash functions

Fa

lse p

osi

tiv

e r

ate

m/n = 8

Opt k = 8 ln 2 = 5.45...

Tradeoffs

• Three parameters.

– Size m/n : bits per item.

– Time k : number of hash functions.

– Error f : false positive probability.

False positive probability decreases

exponentially with linear increase in the

number of hash functions & space

7

Comparison

n * (m/n)!(n log2 n)spacespace

(1-e–k n/m)k (! 0.02)1/nfalse false postivepostive rate (f) rate (f)

O(k)O(log2 n)Lookup timeLookup time

tradeoff between m/n

and f

k = 1

m/n (m/n = 8)2 log2 nbit per elementbit per element

Bloom filtersHashing

Application: Distributed Caching

• Send Bloom filters of URLs

• False positives do not hurt much

– Get errors from cache

changes anyway

L. Fan, P. Cao, J. Almeida and A.Z. Broder “Summary Cache: A scalable wide-areaWeb cache sharing protocol” IEEE/ACM Transactions on Networking 2000

Web Cache 1 Web Cache 2 Web Cache 3

Web Cache 6Web Cache 5Web Cache 4

8

Example

http://www.perl.com/pub/a/2004/04/08/bloom_filters.html

http://www.cs.wisc.edu/~cao/papers/summary-cache/node8.html

http://www.flipcode.com/articles/article_bloomfilters.shtml

http://loaf.cantbedone.org/about.htm

http://www.cap-lore.com/code/BloomTheory.html

http://www.eecs.harvard.edu/~michaelm/NEWWORK/postscripts/cbf2.pdf

http://lemonodor.com/archives/000881.html

http://citeseer.ist.psu.edu/mitzenmacher01compressed.html

J. Byers, J. Considine, M. Mitzenmacher, S. Rost, “Informed ContentDelivery Across Adaptive Overlay Networks” SIGCOMM 2002

Application: Set Reconciliation for Content Delivery

• Suppose two hosts A and B have SA and SB

• A wants to know SA-SB so that it can send those documents to B, that B does

not have

• B sends Bloom filter corresponding to SB

• A sends its documents which are not in that bloom filter

• False positives: approximate

9

• Let HA, HB be hosts responsible for keywords A and B respectively

• Suppose we want documents having both keywords A and B ! FIND

SA∩SB

• Steps:

– HA sends Bloom filter corresponding to SA to HB

– HB computes approximate SA∩SB and sends back to HA

• False positives : HA can find out, so no problem

P. Reynolds and A. Vahdat, “Efficient Peer-to-peer keyword searching”

Application: Set Intersection for Keyword Search

Application: Moderate-sized P2P networks

• Distributed hash tables for scalability

• For moderate sized P2P network – per-node Bloom filter

– Use 8 or 16 bits per object instead of 64 bit identifiers

– False positives : Not much problem

F. M. Cuena-Acuna, C. Peery, R. P. Martin, and T. D. Nguyen,“PlanetP: Using gossiping to build content addressablepeer-to-peer information sharing communities.”

10

Application: Resource Routing

A

CB D E

GF H J KI

L M N

Sb , Sf , Sg , Sh

• Network has tree topology.

• B has bloom filters for all children

sub -trees collectively and also for

each child sub-tree individually.

S. Rhea and J. Kubiatowicz, “Probabilistic Location and Routing”INFOCOMM 2002

B. Gronvall “Scalable Multicast Forwarding” SIGCOMM 2002

Application: Multicast

• Typically routers maintain a list of interfaces for each multicast address

• An Efficient Solution: Keep list of addresses for each interface and useBloom filter to represent these addresses

– Parallelizable

• False Positives: Not bad, just wastes some resources

11

Application: Detecting Routing Loops

• Current mechanism: TTL

• Each packet contain a small Bloom filter to track the nodes visited

– If filter does not change at a node, then a possible loop !!

• False positives: Problematic

A. Whitaker and D. Wetherall “Forwarding without Loops in Icarus”OPENARCH 2002

Application: IP Traceback

• Use Bloom filters to record the packets seen by each router

• False positives:

– Router mistakenly identifies packet as having been seen

– Multiple possible paths

A.C. Snoeren, C. Partridge, L.A. Sanchez, C.E. Jones, F. Tchakountio,S.T.Kent and W.T. Strayer “Hash-based IP traceback” SIGCOMM 2001

12

Summary

• The Bloom Filter Principle:

Wherever a list or set is used, and space is a consideration, a Bloom filter

should be considered. When using a Bloom filter, consider the potential

effects of false positives.

• Space/time tradeoffs in hash coding with allowable errors. B. Bloom.CACM 13 (1970).

• Network Applications of Bloom Filters: A Survey. A. Broder and M.Mitzenmacher. Allerton Conference 2002.

• Compressed Bloom Filters. M. Mitzenmacher. PODC 2001.• Spectral Bloom Filters. S. Cohen and Y. Matias. SIGMOD 2003.• The Bloomier Filter: An Efficient Data Structure for Static Support

Lookup Tables. B. Chazelle, J. Kilian, R. Rubinfeld, and A. Tal. SODA 2004

References

Bloom Filters and their Applications - Texas A&M...

Documents

Transcript of Bloom Filters and their Applications - Texas A&M...