Brahms

40
1 April 2008 Brahms Byzantine-Resilient Random Membership Sampling Bortnikov, Gurevich, Keidar, Kliot, and Shraer

description

Brahms. B yzantine- R esilient R a ndom M embership S ampling. Bortnikov, Gurevich, Keidar, Kliot, and Shraer. Edward (Eddie) Bortnikov. Maxim (Max) Gurevich. Idit Keidar. Alexander (Alex) Shraer. Gabriel (Gabi) Kliot. Why Random Node Sampling. Gossip partners - PowerPoint PPT Presentation

Transcript of Brahms

Page 1: Brahms

1April 2008

Brahms

Byzantine-Resilient Random Membership SamplingBortnikov, Gurevich, Keidar, Kliot, and Shraer

Page 2: Brahms

April 2008 2

Edward (Eddie) Bortnikov Maxim (Max) Gurevich Idit Keidar

Gabriel (Gabi) Kliot Alexander (Alex) Shraer

Page 3: Brahms

April 2008 3

Why Random Node Sampling

Gossip partners Random choices make gossip protocols work

Unstructured overlay networks E.g., among super-peers Random links provide robustness, expansion

Gathering statistics Probe random nodes

Choosing cache locations

Page 4: Brahms

April 2008 4

The Setting

Many nodes – n 10,000s, 100,000s, 1,000,000s, …

Come and go Churn

Every joining node knows some others Connectivity

Full network Like the Internet

Byzantine failures

Page 5: Brahms

April 2008 5

Byzantine Fault Tolerance (BFT)

Faulty nodes (portion f) Arbitrary behavior: bugs, intrusions, selfishness Choose f ids arbitrarily

No CA, but no panacea for Cybil attacks

May want to bias samples Isolate nodes, DoS nodes Promote themselves, bias statistics

Page 6: Brahms

April 2008 6

Previous Work

Benign gossip membership Small (logarithmic) views Robust to churn and benign failures Empirical study [Lpbcast,Scamp,Cyclon,PSS] Analytical study [Allavena et al.] Never proven uniform samples Spatial correlation among neighbors’ views [PSS]

Byzantine-resilient gossip Full views [MMR,MS,Fireflies,Drum,BAR] Small views, some resilience [SPSS] We are not aware of any analytical work

Page 7: Brahms

April 2008 7

Our Contributions

1. Gossip-based BFT membership Linear portion f of Byzantine failures O(n1/3)-size partial views Correct nodes remain connected Mathematically analyzed, validated in

simulations

2. Random sampling Novel memory-efficient approach Converges to proven independent uniform

samples

The view is not all bad

Better than benign gossip

Page 8: Brahms

8April 2008

Brahms

1. Sampling - local component

2. Gossip - distributed componentsample

Sampler

Gossip

view

Page 9: Brahms

April 2008 9

Sampler Building Block

Input: data stream, one element at a time Bias: some values appear more than others Used with stream of gossiped ids

Output: uniform random sample of unique elements seen thus far Independent of other Samplers One element at a time (converging)

next

sample

Sampler

Page 10: Brahms

April 2008 10

Sampler Implementation

Memory: stores one element at a time Use random hash function h

From min-wise independent family [Broder et al.] For each set X, and all , Xx

||

1))()}(Pr(min{

XxhXh

Sampler

next

sample

init

Keep id with smallest hash so far

Choose random hash function

Page 11: Brahms

April 2008 11

Component S: Sampling and Validation

SamplerSampler

sample

Sampler Sampler

nextinit

using pings

id streamfrom gossip

ValidatorValidator Validator Validator

S

Page 12: Brahms

April 2008 12

Gossip Process

Provides the stream of ids for S Needs to ensure connectivity Use a bag of tricks to overcome attacks

Page 13: Brahms

April 2008 13

Gossip-Based Membership Primer

Small (sub-linear) local view V V constantly changes - essential due to churn

Typically, evolves in (unsynchronized) rounds Push: send my id to some node in V

Reinforce underrepresented nodes Pull: retrieve view from some node in V

Spread knowledge within the network [Allavena et al. ‘05]: both are essential

Low probability for partitions and star topologies

Page 14: Brahms

April 2008 14

Brahms Gossip Rounds

Each round: Send pushes, pulls to random nodes from V Wait to receive pulls, pushes Update S with all received ids (Sometimes) re-compute V

Tricky! Beware of adversary attacks

Page 15: Brahms

April 2008 15

Problem 1: Push Drowning

Push Alice

Push Bob

Push Dana

Push Caro

lP

ush

Ed

Push Mallory

Push M

&M

Push Malfoy

A

D

B

EM

M

M

M

M

Page 16: Brahms

April 2008 16

Trick 1: Rate-Limit Pushes

Use limited messages to bound faulty pushes system-wide

E.g., computational puzzles/virtual currency Faulty nodes can send portion p of them Views won’t be all bad

Page 17: Brahms

April 2008 17

Problem 2: Quick Isolation

Push Alice

Push Bob

Push Dana

Push Caro

l

Pu

sh E

d Push MalloryPush M

&M

Pu

sh M

alfo

y

A

C

E

M

Ha! She’s out! Now let’s move on

to the next guy!

D

Page 18: Brahms

April 2008 18

Trick 2: Detection & Recovery

Do not re-compute V in rounds when too many pushes are received

Slows down isolation; does not prevent it

Push Mallory

Push M&M

Pu

sh M

alfo

y

Hey! I’m swamped!I better ignore all of ‘em pushes…

Push Bob

Page 19: Brahms

April 2008 19

Trick 3: Balance Pulls & Pushes

Control contribution of push - α|V| ids versus contribution of pull - β|V| ids Parameters α, β

Pull-only eventually all faulty ids Pull from faulty nodes – all faulty ids,

from correct nodes – some faulty ids Push-only quick isolation of attacked node Push ensures: system-wide not all bad ids Pull slows down (does not prevent) isolation

Page 20: Brahms

April 2008 20

Trick 4: History Samples

Attacker influences both push and pull

Feedback γ|V| random ids from S Parameters α + β + γ = 1

Attacker loses control - samples are eventually perfectly uniform

Yoo-hoo, is there any good process out there?

Page 21: Brahms

April 2008 21

View and Sample Maintenance

Pushed ids

Pulled ids

S |V| |V| |V|

View V Sample

Page 22: Brahms

April 2008 22

Key Property

Samples take time to help Assume attack starts when samples are empty

With appropriate parameters E.g.,

Time to isolation > time to convergence

Prove lower boundusing tricks 1,2,3

(not using samples yet)

Prove upper bound until some good sample

persists forever

3| | | | ( )V S n

Self-healing from partitions

Page 23: Brahms

April 2008 23

History Samples: Rationale

Judicious use essential Bootstrap, avoid slow convergence Deal with churn

With a little bit of history samples (10%) we can cope with any adversary Amplification!

Page 24: Brahms

24April 2008

Analysis

1. Sampling - mathematical analysis

2. Connectivity - analysis and simulation

3. Full system simulation

Page 25: Brahms

April 2008 25

Connectivity Sampling

Theorem: If overlay remains connected indefinitely, samples are eventually uniform

Page 26: Brahms

April 2008 26

Sampling Connectivity Ever After

Perfect sample of a sampler with hash h: the id with the lowest h(id) system-wide

If correct, sticks once the sampler sees it Correct perfect sample self-healing from

partitions ever after

We analyze PSP(t) – probability of perfect sample at time t

Page 27: Brahms

April 2008 27

Convergence to 1st Perfect Sample

n = 1000

f = 0.2

40% unique ids in stream

Page 28: Brahms

April 2008 28

Scalability

Analysis says:

For scalability, want small and constant convergence time independent of system size, e.g., when

2| | | |

( ) (1 )V S

nPSP t e

3| | | | ( )V S n

2| | (log ),| | ( )

log

nV n S

n

Page 29: Brahms

April 2008 29

Connectivity Analysis 1: Balanced Attacks Attack all nodes the same Maximizes faulty ids in views system-wide

in any single round If repeated, system converges to fixed point

ratio of faulty ids in views, which is < 1 if γ=0 (no history) and p < 1/3 or History samples are used, any p

There are always good ids in views!

Page 30: Brahms

April 2008 30

Fixed Point Analysis: Push

i

Local view node 1

Local view node i

Time t:

push

1 Time t+1:

push from faulty node

lost push

x(t) – portion of faulty nodes in views at round t;portion of faulty pushes to correct nodes :

p / ( p + ( 1 − p )( 1 − x(t) ) )

Page 31: Brahms

April 2008 31

Fixed Point Analysis: Pull

i

Local view node 1

Local view node i

Time t:

pull from i: faulty with probability x(t)

Time t+1:

E[x(t+1)] = p / (p + (1 − p)(1 − x(t))) + ( x(t) + (1-x(t))x(t) ) + γf

pull from faulty

Page 32: Brahms

April 2008 32

Faulty Ids in Fixed Point

With a few history samples, any

portion of bad nodes can be toleratedPerfectly validated

fixed pointsand convergence

Assumed perfect in analysis, real history

in simulations

Page 33: Brahms

April 2008 33

Convergence to Fixed Point

n = 1000

p = 0.2

α=β=0.5

γ=0

Page 34: Brahms

April 2008 34

Connectivity Analysis 2:Targeted Attack – Roadmap Step 1: analysis without history samples

Isolation in logarithmic time … but not too fast, thanks to tricks 1,2,3

Step 2: analysis of history sample convergence Time-to-perfect-sample < Time-to-Isolation

Step 3: putting it all together Empirical evaluation No isolation happens

Page 35: Brahms

April 2008 35

Targeted Attack – Step 1

Q: How fast (lower bound) can an attacker isolate one node from the rest?

Worst-case assumptions No use of history samples ( = 0) Unrealistically strong adversary

Observes the exact number of correct pushes and complements it to α|V|

Attacked node not represented initially Balanced attack on the rest of the system

Page 36: Brahms

April 2008 36

Isolation w/out History Samples

n = 1000

p = 0.2

α=β=0.5

γ=0

Depend on α,β,p

Isolation time for |V|=60

2x2 i, ji

E(indegree(t 1)) E(indegree(t))A , A 1

E(outdegree(t 1)) E(outdegree(t)

Page 37: Brahms

April 2008 37

Step 2: Sample Convergence

Perfect sample in 2-3 rounds

n = 1000

p = 0.2

α=β=0.5, γ=0

40% unique ids

Empirically verified

Page 38: Brahms

April 2008 38

Step 3: Putting It All TogetherNo Isolation with History Samples

Works well despite small PSP

n = 1000

p = 0.2

α=β=0.45

γ=0.1

Page 39: Brahms

April 2008 39

p = 0.2

α=β=0.45

γ=0.1

Sample Convergence (Balanced)

32|||| NSV

32|||| nSV

Convergence twice as fast with 33|||| nSV

Page 40: Brahms

40April 2008

Summary

O(n1/3)-size views Resist Byzantine failures of linear portion Converge to proven uniform samples Precise analysis of impact of failures