Sabotage-Tolerance Mechanisms for Volunteer Computing Systems
description
Transcript of Sabotage-Tolerance Mechanisms for Volunteer Computing Systems
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 1
Sabotage-Tolerance Mechanisms for
Volunteer Computing Systems
Luis F. G. SarmentaAteneo de Manila University, Philippines
(formerly MIT LCS)
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 2
Volunteer Computing• Idea: Make it very easy for
even non-expert users to join a NOW by themselves
• Minimal setup requirements Maximum participation
• Very large NOWs very quickly!– just invite people– SETI@home,
distributed.net, others• The Dream:
Electronic Bayanihan– achieving the impossible
through cooperation “Bayanihan” Mural by Carlos “Botong” Francisco, commissioned by Unilab, Philippines. Used with
permission.
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 3
The Problem• Allowing anyone to join means possibility
of malicious attacks• Sabotage
– bad data from malicious volunteers• Traditional Approach
– Encryption works against spoofing by outsiders• but not against registered volunteers
– Checksums guard against random faults• but not against saboteurs who disassemble code
• Another Approach: Obfuscation– prevent saboteurs from disassembling code– periodically reobfuscate to avoid disassembly– Promising. But what if we can’t do it, or it doesn’t
work?
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 4
Voting and Spot-checking• Assume worst case
– we can’t trust workers, so need to double-check• Voting
– Everything must be done at least m times– Majority wins (like Elections)– e.g., Triple-Modular-Redundancy, NMR, etc.– Problem: not so efficient.
• Spot-checking– Don’t check all the time. Only sometimes.– But if you’re caught, you’re “dead”
• Backtrack – all results of caught saboteur are invalidated• Blacklist – saboteur’s results are ignored
– Scare people into compliance (like Customs at Airport!)– More efficient (?)
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 5
Theoretical Analysis: Assumptions
• Master-worker model– eager scheduling lets us redo, undo work– several batches, each batch N works– P workers, fraction f are saboteurs– same speed saboteurs, so work roughly evenly distributed– no spare workers, so higher redundancy (# of work given
out) means worse slowdown (time)• Non-zero acceptable error rate, erracc
– error rate (err) =• average fraction of bad final results in a batch• probability of error of an individual final result
– relatively high for naturally fault-tolerant apps• e.g., image rendering, genetic algorithms, etc.
– correspondingly small for most other apps• e.g., to guarantee 1% failure rate for 1000 works,
erracc = 1% / 1000 = 1e-5
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 6
Theoretical Analysis: Assumptions
• Assume saboteurs are Bernoulli processes with independent, identical, constant sabotage rate, s– implies that saboteurs do not agree on when to give
bad answers (unless they always give them)– simplifying assumption, may not be realistic– but may be OK if we assume saboteurs receive works
at different times and cannot distinguish them• Assume saboteurs’ bad answers agree
– allows them to vote (if they happen to give bad answers at the same time)
– pessimistic assumption– we can use crypto and checksums to make it hard to
generate agreeing answers– implies that there are only 2 kinds of answers: bad and
good
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 7
Majority Voting• m-majority voting
– m out of 2m-1 must match– used in hardware, and
systems with spare processors– redundancy = 2m-1
• m-first voting– accept as soon as m match– same error rate but faster– redundancy = m/(1-f)
• exponential error rate– err = (cf)m
– where c is between 1.7 and 4• Good for small f, but bad for large f• Minimum redundancy & slowdown of 2
1.E-221.E-201.E-181.E-161.E-141.E-121.E-101.E-081.E-061.E-041.E-021.E+00
1 2 3 4 5 6m
err
2.E-01 1.E-01 1.E-021.E-03 1.E-04
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 8
Spot-Checking w/ blacklisting• Lower redundancy
– 1/(1-q)• Good error rates due
to backtracking– no error as long as saboteur is
caught by end of batch• err = sf(1-qs)n
(1-f)+f(1-qs)n
– where n is number of work received in batch (related but a bit more than N/P)
• Saboteurs strategy: only give a few bad answers– s* = 1/(q(n+1))
• Max error, err* < (f/(1-f))(1/qne)• Linear error reduction
according to n– larger batches, better error rates
0
0.005
0.01
0.015
0.02
0.025
0.03
0.035
0.04
0.045
0.05
0 0.2 0.4 0.6 0.8 1s
err
0.7
0.5
0.2
0.1
0.05
0.01
Simulator Results.Note that it workseven if f > 0.5
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 9
Spot-Checking w/o blacklisting• What if saboteurs can come
back under new identity?• Saboteur’s strategy:
stay for L turns only• Max error
– err* < f / qLe, if L << n– err* < f / qL, as L -> n– err* < f / qL in all cases
• Linear error reduction according to L, not n– larger batches don’t guarantee
better error rates anymore– L = 1 gives worst errors, err = f(1-q)
• Try to force larger L’s– make forging new ID difficult;
impose sign-on delays– batch-limited blacklisting
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 10
Voting and Spot-Checking• Simply running them
together works!• With blacklisting, we can use
spot-checking err rate in placeof f
exponentially reduce linearly-reduced error rate– (qne(1-f))m improvement– big difference! (esp. for large f)
• Unfortunately, doesn’t workas well w/o blacklisting– bad err rate to begin with– substituting err for f doesn’t work
• Problem are saboteurs who come back near end of batch
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+000 50 100 150 200 250
l
err
m=2, no SCm=2, l-staym=2, staym=2, th. UBm=2, SC BLm=3,no SCm=3, l-staym=3, staym=3, th. UBm=3, SC BL
1.E-10
1.E-09
1.E-08
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-01
1.E+000 0.05 0.1 0.15 0.2 0.25
s
err
f=0.5, m=2f=0.5, m=3f=0.2, m=2f=0.2, m=3f=0.1, m=2f=0.1, m=3
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 11
• Problem: errors come from saboteurs who have not yet been spot-checked enough
• Idea: give workers credibility depending on number of spot-checks passed, k
• General Idea: attach credibility values to objects in system• Credibility of X, Cr(X) = probability that X is, or will,
give a good result
Credibility-Based Fault-Tolerance
CredWorkPoolnextUnDoneWork
Work0
CrW = 0.8Work999
CrW= 0.999 Done, res=J
Work1
CrW = 0.492
. . .
Work997
CrW = 0.967. . . Work998
CrW = 0.9992Done, res=Z
θ = 0.999, assuming f 0.2
pidP1
resA
CrG = 0.8CrR0.8
Worker P1
k 0Crp = 0.8
Worker P2
k 6Crp = 0.967
Worker P6
k 6Crp = 0.967
Worker P7
k 125Crp = 0.998
Worker P8
k 3Crp = 0.933
Worker P9
k 200Crp = 0.999
pidP2
resH
CrG = 0.492CrR
0.967pidP6
resB
CrG = 0.492CrR
0.967pidP2
resG
CrG = 0.967CrR
0.967pidres
CrG = 0.9992CrR
P6Z 0.967pidP8
resM
CrG = 0.0008CrR
0.967pidP9
resJ
CrG = 0.999CrR
0.999P7Z 0.998
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 12
• 4 types of credibility (in this implementation)– worker, result, result group, work entry
• Credibility Threshold Principle: if we only accept a final result if the conditional probability of it being correct is at least θ, then overall ave. err rate will be at most (1-θ)
• Wait until credibility is high enough
Credibility-Based Fault-Tolerance
CredWorkPoolnextUnDoneWork
Work0
CrW = 0.8Work999
CrW= 0.999 Done, res=J
Work1
CrW = 0.492
. . .
Work997
CrW = 0.967. . . Work998
CrW = 0.9992Done, res=Z
θ = 0.999, assuming f 0.2
pidP1
resA
CrG = 0.8CrR0.8
Worker P1
k 0Crp = 0.8
Worker P2
k 6Crp = 0.967
Worker P6
k 6Crp = 0.967
Worker P7
k 125Crp = 0.998
Worker P8
k 3Crp = 0.933
Worker P9
k 200Crp = 0.999
pidP2
resH
CrG = 0.492CrR
0.967pidP6
resB
CrG = 0.492CrR
0.967pidP2
resG
CrG = 0.967CrR
0.967pidres
CrG = 0.9992CrR
P6Z 0.967pidP8
resM
CrG = 0.0008CrR
0.967pidP9
resJ
CrG = 0.999CrR
0.999P7Z 0.998
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 13
Computing Credibility• Worker, CrP(P)
– dubiosity (1-Cr) decreases linearly with # of spot-checks passed, k– CrP(P) = 1 – f, without spot-checking– CrP(P) = 1 – f / ke(1-f), with spot-checking and blacklisting– CrP(P) = 1 – f / k, with spot-checking without blacklisting
• Result, CrR(R)– taken from CrP(R.solver)
• Result Group, CrG(G)– generally increases as # of matching good-credibility results
increase– conditional probability given other groups, and CrR of results– CrG(Ga) = P(Ga good)P(all others bad)
P(getting the groups we got)– e.g., if CrR(R) = 1 – f, for all R, and only 2 groups
CrG = (1-f)m1 fm2 / ((1-f)m1 fm2 + fm1 (1-f)m2 + fm1fm2 )• Work Entry, CrW(W)
– CrG(G) of best group
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 14
Results: Credibility w/ blacklisting• N=10000, P=200,
f = 0.2, 0.1, 0.05q = 0.1batch-limited blacklisting
• Note that error never goes above threshold
• Trade-off is in slowdown• Slowdown / err ratio is
very good– each additional repetition
gives > 100x improvement in error rate
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-010 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
s
err
0.99 0.999 0.9999 0.99999 0.999999
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err
slowdown
0.20.10.05
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 15
Results: Credibility w/o blacklisting• Error still never goes above
threshold• A bit slower• immune to short-staying
saboteurs– encourages longer stay
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-010 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
s
err
0.99 0.999 0.9999 0.99999 0.999999
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err
slowdown 0.2
0.10.050.01
0.E+00
5.E-06
1.E-05
2.E-05
2.E-05
3.E-05
3.E-05
0 50 100 150 200 250
l
err
stay for l or until caught stay until caught
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 16
Results: Using Voting to Spot-check• Normally, spot-check rate is low
because it implies overhead• We can use cred-based voting to
spot-check since cred-based voting has guaranteed low err
• if redundancy >= 2, then effectively, q = 1
• Saboteurs get caught quickly -> low error rates
• Good workers gain high credibility by passing a lot -> reach threshold faster
• Very good slowdown to err slope– about 3 orders-of-magnitude per
extra redundancy– good for non-fault-tolerant apps
1.E-07
1.E-06
1.E-05
1.E-04
1.E-03
1.E-02
1.E-010 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
s
err
0.99 0.9999 0.999999
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err
slowdown
0.50.20.1
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 17
Slowdown vs.Err
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err
slowdown
0.50.20.1
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err
slowdown 0.2
0.10.050.01
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err
slowdown
0.20.10.05
1
1.5
2
2.5
3
3.5
4
4.5
5
1.E-071.E-061.E-051.E-041.E-031.E-021.E-011.E+00err
slowdown
0.20.150.10.050.01
voting only cred, w/ SC & BL
cred, w/ SC, w/o BL cred, using Voting for SC, w/o BL
At f=20%,for the same slowdown,
cred w/ V-SC, w/o BLgets 10^5 times better
err rate thanm-first majority voting!
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 18
Variations• Credibility-based fault-tolerance is highly-
generalizable• Credibility Threshold Principle holds in all cases
– provided that we compute conditional probability correctly
• Change in assumptions and implementations lead to change in credibility metrics, e.g.,– if we assume saboteurs communicate, then change
result group credibility– if we have trustable hosts, or untrustable domains,
adjust worker cred. accordingly– if we can use checksums, encryption, obfuscation,
etc., then adjust CrP, CrG, etc.– time-varying credibility– compute credibility of batches or work pools
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 19
Summary of Mechanisms• Voting
– error reduction exponential with redundancy– but bad for large f– minimum redundancy of 2
• Spot-checking with backtracking and blacklisting– error reduction linear with work done by each volunteer– lower redundancy– good for large f
• Voting and Spot-checking– exponentially reduce linearly-reduced error rate
• Credibility-based Fault-Tolerance– guarantee limit on error by watching conditional prob.– automatically combines voting and spot-checking as
necessary– more efficient than simple voting and spot-checking– open to variations
Project BayanihanMIT LCS and Ateneo de Manila University
Luis F. G. Sarmenta. CCGrid 2001, 5/15/2001. slide 20
For more information• Recently finished Ph.D. thesis
– Volunteer Computing by Luis F. G. Sarmenta, MIT.
• This, and other papers available from:– http://www.cag.lcs.mit.edu/bayanihan/
• Paper at IC 2001 (w/ PDPTA 2001)Las Vegas, June 25-28 – more on how we parallelized the simulation– details are also in thesis
• Email:– [email protected] or [email protected]