Dictionary problem -...

15
Prof. Paolo Ferragina, Algoritmi per "Information Retrieval" 1 Data Sketching The Bloom Filter (membership with controlled-error) Dictionary problem What data structures you know for storing a set of keys and supporting exact searches and insert operations over them? Hashing What about false positives?

Transcript of Dictionary problem -...

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

1

Data Sketching

The Bloom Filter

(membership with controlled-error)

Dictionary problem

What data structures you know for storing a

set of keys and supporting exact searches

and insert operations over them?

Hashing

What about false positives?

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

2

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

3

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

4

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

5

TTT

TTT

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

6

TTT 2

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

7

Not perfectly true but...

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

8

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

9

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

10

Upper bounds

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

11

Upper bounds

Recurring minimum forimproving the estimate

+ 2 SBF

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

12

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

13

Web Algorithmics

File Synchronization

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

14

File synch: The problem

� client wants to update an out-dated file

� server has new file but does not know the old file

� update without sending entire f_new (using similarity)

� rsync: file synch tool, distributed with Linux

Server Client

updatef_new f_old

request

The rsync algorithm

Server Client

encoded filef_new f_old

hashes

Prof. Paolo Ferragina, Algoritmi per "Information Retrieval"

15

The rsync algorithm (contd)

� simple, widely used, single roundtrip

� optimizations: 4-byte rolling hash + 2-byte MD5, gzip for literals

� choice of block size problematic (default: max{700, √n} bytes)

� not good in theory: granularity of changes may disrupt use of blocks

Gzip