What’s the Difference? Efficient Set Reconciliation without Prior Context

40
What’s the Difference? Efficient Set Reconciliation without Prior Context Frank Uyeda University of California, San Diego David Eppstein, Michael T. Goodrich & George Varghese 1

description

What’s the Difference? Efficient Set Reconciliation without Prior Context. Frank Uyeda University of California, San Diego David Eppstein , Michael T. Goodrich & George Varghese. Motivation. Distributed applications often need to compare remote state . R1. R2. Partition Heals. - PowerPoint PPT Presentation

Transcript of What’s the Difference? Efficient Set Reconciliation without Prior Context

Page 1: What’s the Difference? Efficient Set Reconciliation without Prior Context

What’s the Difference?Efficient Set Reconciliation

without Prior Context

Frank UyedaUniversity of California, San Diego

David Eppstein, Michael T. Goodrich & George Varghese1

Page 2: What’s the Difference? Efficient Set Reconciliation without Prior Context

2

Motivation

• Distributed applications often need to compare remote state.

R1 R2

Must solve the Set-Difference Problem!

Partition Heals

Page 3: What’s the Difference? Efficient Set Reconciliation without Prior Context

3

What is the Set-Difference problem?

• What objects are unique to host 1?• What objects are unique to host 2?

A

Host 1 Host 2

CAFEB D F

Page 4: What’s the Difference? Efficient Set Reconciliation without Prior Context

4

Example 1: Data Synchronization

• Identify missing data blocks• Transfer blocks to synchronize sets

A

Host 1 Host 2

CAFEB D F

DC

B E

Page 5: What’s the Difference? Efficient Set Reconciliation without Prior Context

5

Example 2: Data De-duplication

• Identify all unique blocks.• Replace duplicate data with pointers

A

Host 1 Host 2

CAFEB D F

Page 6: What’s the Difference? Efficient Set Reconciliation without Prior Context

6

Set-Difference Solutions• Trade a sorted list of objects.– O(n) communication, O(n log n) computation

• Approximate Solutions:– Approximate Reconciliation Tree (Byers)

• O(n) communication, O(n log n) computation

• Polynomial Encodings (Minsky & Trachtenberg)– Let “d” be the size of the difference– O(d) communication, O(dn+d3) computation

• Invertible Bloom Filter– O(d) communication, O(n+d) computation

Page 7: What’s the Difference? Efficient Set Reconciliation without Prior Context

7

Difference Digests

• Efficiently solves the set-difference problem.• Consists of two data structures:– Invertible Bloom Filter (IBF)• Efficiently computes the set difference.• Needs the size of the difference

– Strata Estimator• Approximates the size of the set difference.• Uses IBF’s as a building block.

Page 8: What’s the Difference? Efficient Set Reconciliation without Prior Context

8

Invertible Bloom Filters (IBF)

• Encode local object identifiers into an IBF.

A

Host 1 Host 2

CAFEB D F

IBF 2IBF 1

Page 9: What’s the Difference? Efficient Set Reconciliation without Prior Context

9

IBF Data Structure

• Array of IBF cells– For a set difference of size, d, require αd cells

(α > 1)• Each ID is assigned to many IBF cells• Each IBF cell contains:

idSum XOR of all ID’s in the cellhashSum XOR of hash(ID) for all ID’s in the cellcount Number of ID’s assign to the cell

Page 10: What’s the Difference? Efficient Set Reconciliation without Prior Context

10

IBF EncodeA

idSum ⊕ AhashSum ⊕ H(A)

count++

idSum ⊕ AhashSum ⊕

H(A)count++

idSum ⊕ AhashSum ⊕

H(A)count++

Hash1 Hash2 Hash3

B C

Assign ID to many cells

IBF:

αd “Add” ID to cellNot O(n), like

Bloom Filters!

All hosts use the same hash functions

Page 11: What’s the Difference? Efficient Set Reconciliation without Prior Context

11

Invertible Bloom Filters (IBF)

• Trade IBF’s with remote host

A

Host 1 Host 2

CAFEB D F

IBF 2IBF 1

Page 12: What’s the Difference? Efficient Set Reconciliation without Prior Context

12

Invertible Bloom Filters (IBF)

• “Subtract” IBF structures– Produces a new IBF containing only unique objects

A

Host 1 Host 2

CAFEB D F

IBF 2

IBF 1

IBF (2 - 1)

Page 13: What’s the Difference? Efficient Set Reconciliation without Prior Context

13

IBF Subtract

Page 14: What’s the Difference? Efficient Set Reconciliation without Prior Context

Timeout for Intuition

• After subtraction, all elements common to both sets have disappeared. Why?– Any common element (e.g W) is assigned to same cells on

both hosts (assume same hash functions on both sides)– On subtraction, W XOR W = 0. Thus, W vanishes.

• While elements in set difference remain, they may be randomly mixed need a decode procedure.

14

Page 15: What’s the Difference? Efficient Set Reconciliation without Prior Context

15

Invertible Bloom Filters (IBF)

• Decode resulting IBF– Recover object identifiers from IBF structure.

A

Host 1 Host 2

CAFEB D F

IBF (2 - 1)

B E C DHost 1 Host 2IBF 2

IBF 1

Page 16: What’s the Difference? Efficient Set Reconciliation without Prior Context

16

IBF Decode

H(V X Z)⊕ ⊕≠

H(V) H(X) ⊕ ⊕H(Z)

Test for Purity:H( idSum )H( idSum ) = hashSumH(V) = H(V)

Page 17: What’s the Difference? Efficient Set Reconciliation without Prior Context

17

IBF Decode

Page 18: What’s the Difference? Efficient Set Reconciliation without Prior Context

18

IBF Decode

Page 19: What’s the Difference? Efficient Set Reconciliation without Prior Context

19

IBF Decode

Page 20: What’s the Difference? Efficient Set Reconciliation without Prior Context

20

Small Diffs:1.4x – 2.3x

Large Differences:1.25x - 1.4x

How many IBF cells?Sp

ace

Ove

rhea

d

Set Difference

Hash Cnt 3Hash Cnt 4

Overhead to decode at >99%

Page 21: What’s the Difference? Efficient Set Reconciliation without Prior Context

How many hash functions?

• 1 hash function produces many pure cells initially but nothing to undo when an element is removed.

21

A B

C

Page 22: What’s the Difference? Efficient Set Reconciliation without Prior Context

How many hash functions?

• 1 hash function produces many pure cells initially but nothing to undo when an element is removed.

• Many (say 10) hash functions: too many collisions.

22

A A B

C B C

A A

B B

C C

Page 23: What’s the Difference? Efficient Set Reconciliation without Prior Context

How many hash functions?

• 1 hash function produces many pure cells initially but nothing to undo when an element is removed.

• Many (say 10) hash functions: too many collisions.• We find by experiment that 3 or 4 hash functions

works well. Is there some theoretical reason?

23

A A B

C C

A

B

B

C

Page 24: What’s the Difference? Efficient Set Reconciliation without Prior Context

Theory

• Let d = difference size, k = # hash functions.• Theorem 1: With (k + 1) d cells, failure probability

falls exponentially. – For k = 3, implies a 4x tax on storage, a bit weak.

• [Goodrich,Mitzenmacher]: Failure is equivalent to finding a 2-core (loop) in a random hypergraph

• Theorem 2: With ck d, cells, failure probability falls exponentially

– c4 = 1.3x tax, agrees with experiments

24

Page 25: What’s the Difference? Efficient Set Reconciliation without Prior Context

25

Large Differences:1.25x - 1.4x

How many IBF cells?Sp

ace

Ove

rhea

d

Set Difference

Hash Cnt 3Hash Cnt 4

Overhead to decode at >99%

Page 26: What’s the Difference? Efficient Set Reconciliation without Prior Context

Connection to Coding

• Mystery: IBF decode similar to peeling procedure used to decode Tornado codes. Why?

• Explanation: Set Difference is equivalent to coding with insert-delete channels

• Intuition: Given a code for set A, send codewords only to B. Think of B’s set as a corrupted form of A’s.

• Reduction: If code can correct D insertions/deletions, then B can recover A and the set difference.

26

Reed Solomon <---> Polynomial Methods LDPC (Tornado) <---> Difference Digest

Page 27: What’s the Difference? Efficient Set Reconciliation without Prior Context

27

Difference Digests

• Consists of two data structures:– Invertible Bloom Filter (IBF)• Efficiently computes the set difference.• Needs the size of the difference

– Strata Estimator• Approximates the size of the set difference.• Uses IBF’s as a building block.

Page 28: What’s the Difference? Efficient Set Reconciliation without Prior Context

28

Strata EstimatorA

ConsistentPartitioning

B C

~1/2

~1/4

~1/8

1/16

IBF 1

IBF 4

IBF 3

IBF 2

Estimator

• Divide keys into partitions of containing ~1/2k

• Encode each partition into an IBF of fixed size– log(n) IBF’s of ~80 cells each

Page 29: What’s the Difference? Efficient Set Reconciliation without Prior Context

29

4x

Strata Estimator

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 1

• Attempt to subtract & decode IBF’s at each level.• If level k decodes, then return:

2k x (the number of ID’s recovered)

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 2…Decode

Host 1 Host 2

Page 30: What’s the Difference? Efficient Set Reconciliation without Prior Context

30

4x

Strata Estimator

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 1

• Attempt to subtract & decode IBF’s at each level.• If level k decodes, then return:

2k x (the number of ID’s recovered)

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 2…

DecodeHost 1 Host 2

What about the other strata?

Page 31: What’s the Difference? Efficient Set Reconciliation without Prior Context

31

2x

Strata Estimator

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 1…

IBF 1

IBF 4

IBF 3

IBF 2

Estimator 2…

Decode

Decode

Host 1 Host 2

Host 2Host 1

• Observation: Extra partitions hold useful data• Sum elements from all decoded strata & return:

2(k-1) x (the number of ID’s recovered)

DecodeHost 1 Host 2

Page 32: What’s the Difference? Efficient Set Reconciliation without Prior Context

32

Estimation Accuracy

Strata good for

small d

ifferences.

Min-Wise

good fo

r

large

differences.

Average Estimation Error (15.3 KBytes)

Set Difference

Rela

tive

Erro

r in

Estim

ation

(%)

Page 33: What’s the Difference? Efficient Set Reconciliation without Prior Context

33

Hybrid Estimator

IBF 1

IBF 4

IBF 3

IBF 2

Strata

• Combine Strata and Min-Wise Estimators.– Use IBF Stratas for small differences.– Use Min-Wise for large differences.

…IBF 1

Min-Wise

IBF 2

Hybrid

IBF 3

Page 34: What’s the Difference? Efficient Set Reconciliation without Prior Context

34

Hybrid Estimator Accuracy

Hybrid matches Strata for small differences.

Converges with Min-wise for large differences

Set Difference

Average Estimation Error (15.3 KBytes)

Rela

tive

Erro

r in

Estim

ation

(%)

Page 35: What’s the Difference? Efficient Set Reconciliation without Prior Context

35

Application: KeyDiff Service

• Promising Applications:– File Synchronization– P2P file sharing– Failure Recovery

Key Service

Key Service

Key Service

Application Application

Application

Add( key )Remove( key )Diff( host1, host2 )

Page 36: What’s the Difference? Efficient Set Reconciliation without Prior Context

36

Difference Digests Summary

• Strata & Hybrid Estimators– Estimate the size of the Set Difference.– For 100K sets, 15KB estimator has <15% error– O(log n) communication, O(log n) computation.

• Invertible Bloom Filter– Identifies all ID’s in the Set Difference.– 16 to 28 Bytes per ID in Set Difference.– O(d) communication, O(n+d) computation.

• Implemented in KeyDiff Service

Page 37: What’s the Difference? Efficient Set Reconciliation without Prior Context

Conclusions: Got Diffs?

• New randomized algorithm (difference digests) for set difference or insertion/deletion coding

• Could it be useful for your system? Need:– Large but roughly equal size sets – Small set differences (less than 10% of set size)

37

Page 38: What’s the Difference? Efficient Set Reconciliation without Prior Context

38

Page 39: What’s the Difference? Efficient Set Reconciliation without Prior Context

39

Extra Slides

Page 40: What’s the Difference? Efficient Set Reconciliation without Prior Context

40

Comparison to Logs

• IBF work with no prior context.• Logs work with prior context, BUT– Redundant information when sync’ing with

multiple parties.– Logging must be built into system for each write.– Logging add overhead at runtime.– Logging requires non-volatile storage.• Often not present in network devices.

IBF’s may out-perform logs when:• Synchronizing multiple parties• Synchronizations happen infrequently