Post on 06-Jan-2016
description
CS514: Intermediate Course in Operating SystemsProfessor Ken BirmanBen Atkin: TALecture 21 Nov. 7
Bimodal MulticastWork is joint: Ken Birman, Mark Hayden, Oznur Ozkasap, Zhen Xiao, Mihai Budiu, Yaron MinskyPaper appeared in ACM Transactions on Computer SystemsMay, 1999
Reliable MulticastFamily of protocols for 1-n communicationIncreasingly important in modern networksReliability tends to be at one of two extremes on the spectrum:Very strong guarantees, for example to replicate a controller position in an ATC systemBest effort guarantees, for applications like Internet conferencing or radio
Reliable MulticastCommon to try and reduce problems such as this to theoryIssue is to specify the goal of the multicast protocol and then demonstrate that a particular implementation satisfies its spec.Each form of reliability has its down specification and its own pros and cons
Classical multicast problemsRelated to coordinated attack, consensusThe two schools of reliable multicastVirtual synchrony model (Isis, Horus, Ensemble)All or nothing message delivery with orderingMembership managed on behalf of groupState transfer to joining memberScalable reliable multicastLocal repair of problems but no end-to-end guarantees
Virtual Synchrony ModelcrashG0={p,q} G1={p,q,r,s} G2={q,r,s} G3={q,r,s,t}pqrstr, s request to joinr,s added; state xfert added, state xfert requests to joinp fails... to date, the only widely adopted model for consistency andfault-tolerance in highly available networked applications
Virtual Synchrony ToolsVarious forms of replication:Replicated data, replicate an object, state transfer for starting new replicas...1-many event streams (network news)Load-balanced and fault-tolerant request executionManagement of groups of nodes or machines in a network setting
ATC system (simplified)ControllersAir Traffic Database (flight plans, etc)X.500 DirectoryRadarOnboard
Roles for reliable multicast?Replicate the state of controller consoles so that a group of consoles can support a team of controllers fault-tolerantly
Roles for reliable multicast?Distribute routine updates from the radar system, every 3-5 seconds (two kinds: images and digital tracking info)
Roles for reliable multicast?Distribute flight plan updates and decisions rarely (very critical information)
Other settings that use reliable multicastBasis of several stock exchangesCould be used to implement enterprise database systems and wide-area network servicesCould be the basis of web caching tools
But vsync protocols dont scale wellFor small systems, performance is great.For large systems can demonstrate very good performance but only if applications cooperateIn large systems, under heavy load, with applications that may act a little flakey, performance is very hard to maintain
Stock Exchange Problem: Vsync. multicast is too fragileMost members are healthy.
but one is slow
Throughput as one member of a multicast group is "perturbed" by forcing it to sleep for varying amounts of time.
Why does this happen?Superficially, because data for the slow process piles up in the senders buffer, causing flow control to kick in (prematurely)But why does the problem grow worse as a function of group size, with just one red process?Our theory is that small perturbations happen all the time (but more study is needed)
Dilemma confronting developersApplication is extremely critical: stock market, air traffic control, medical systemHence need a strong model, guaranteesSRM, RMTP, etc: model is too weakAnyhow, weve seen that they dont scale either!But these applications often have a soft-realtime subsystemSteady data generationMay need to deliver over a large scale
Today introduce a new design pt.Bimodal multicast is reliable in a sense that can be formalized, at least for some networksGeneralization for larger class of networks should be possible but maybe not easyProtocol is also very stable under steady load even if 25% of processes are perturbedScalable in much the same way as SRM
Bimodal AttackGeneral commands soldiersIf almost all attack victory is certainIf very few attack, they will perish but Empire survivesIf many attack, Empire is lostAttack!Consensus isimpossible!
Worse and worse...General sends orders via couriers, who might perish or get lost in the forest (no spies or traitors)Some of the armies are sick with flu. They wont fight and may not even have healthy couriers to participate in protocol(Message omission, crash and timing faults.)
Properties DesiredBimodal distribution of multicast deliveryOrdering: FIFO on sender basisStability: lets us garbage collect without violating bimodal distributionDetection of lost messages
More desired propertiesFormal model allows us to reason about how the primitive will behave in real systemsScalable protocol: costs (background messages, throughput) do not degrade as a function of system sizeExtremely stable throughput
Naming our new protocolBimodal doesnt lend itself to traditional contractions (bcast is taken, bicast and bimcast sound pretty awful)Focus on probabilistic nature of guaranteesCall it pbcast for sureHistorical footnote: these are multicast protocols but everyone uses bcast in names
EnvironmentWill assume that most links have known throughput and loss propertiesAlso assume that most processes are responsive to messages in bounded timeBut can tolerate some flakey links and some crashed or slow processes.
Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)
Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesnt include them.
Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip
Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.
Delivery? Garbage Collection?Deliver a message when it is in FIFO orderGarbage collect a message when you believe that no healthy process could still need a copy (we used to wait 10 rounds, but now are using gossip to detect this condition)Match parameters to intended environment
Need to bound costsWorries:Someone could fall behind and never catch up, endlessly loading everyone elseWhat if some process has lots of stuff others want and they bombard him with requests?What about scalability in buffering and in list of members of the system, or costs of updating that list?
OptimizationsRequest retransmissions most recent multicast firstIdea is to catch up quickly leaving at most one gap in the retrieved sequence
OptimizationsParticipants bound the amount of data they will retransmit during any given round of gossip. If too much is solicited they ignore the excess requests
OptimizationsLabel each gossip message with senders gossip round numberIgnore solicitations that have expired round number, reasoning that they arrived very late hence are probably no longer correct
OptimizationsDont retransmit same message twice in a row to any given destination (the copy may still be in transit hence request may be redundant)
OptimizationsUse IP multicast when retransmitting a message if several processes lack a copyFor example, if solicited twiceAlso, if a retransmission is received from far awayTradeoff: excess messages versus low latencyUse regional TTL to restrict multicast scope
ScalabilityProtocol is scalable except for its use of the membership of the full process groupUpdates could be costlySize of list could be costlyIn large groups, would also prefer not to gossip over long high-latency links
Router overload problemRandom gossip can overload a central routerYet information flowing through this router is of diminishing quality as rate of gossip risesInsight: constant rate of gossip is achievable and adequate
Hierarchical GossipWeight gossip so that probability of gossip to a remote cluster is smallerCan adjust weight to have constant load on routerNow propagation delays rise but just increase rate of gossip to compensate
Hierarchical Gossip
Remainder of talkShow results of formal analysisWe developed a model (wont do the math here -- nothing very fancy)Used model to solve for expected reliabilityThen show more experimental dataReal question: what would pbcast do in the Internet? Our experience: it works!
Idea behind analysisCan use the mathematics of epidemic theory to predict reliability of the protocolAssume an initial stateNow look at result of running B rounds of gossip: converges exponentially quickly towards atomic delivery
Chart1
0
0.0291968046
0.000581897
0.0000102291
0.0000001783
0.0000000033
0.0000000001
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0.0000000005
0.0000000364
0.0000035333
0.0006171974
0
number of processes to deliver pbcast
p{#processes=k}
Pbcast bimodal delivery distribution
Sheet1
00.00E+00
206.6298838.34570312.19E-036.43E-01101.78E-081.77E-0612.92E-02
255.9189458.653321.53.02E-031.02E+00152.18E-111.62E-0525.82E-04
305.440437.3408221.47E-039.52E-01203.42E-135.71E-0731.02E-05
355.0712897.5048832.53.92E-045.38E-01257.87E-161.66E-0641.78E-07
404.7636726.77343837.05E-052.07E-01308.89E-184.96E-0853.29E-09
454.531256.8759773.59.48E-066.13E-02352.60E-209.97E-0866.67E-11
504.3193366.33593841.01E-061.51E-02402.17E-223.10E-0971.53E-12
4.59.00E-083.26E-03458.26E-255.38E-0984.05E-14
56.85E-096.40E-04505.80E-271.76E-1091.26E-15
5.54.60E-101.16E-04553.67E-292.82E-10104.63E-17
62.88E-111.98E-05601.65E-311.56E-11112.04E-18
6.51.91E-123.20E-06121.08E-19
71.94E-135.07E-07136.95E-21
7.54.57E-148.72E-08145.45E-22
82.09E-142.03E-08155.23E-23
8.51.35E-147.74E-09166.16E-24
91.07E-144.37E-09178.93E-25
9.59.55E-153.08E-09181.60E-25
109.01E-152.50E-09193.53E-26
209.65E-27
213.27E-27
221.37E-27
237.13E-28
244.59E-28
253.67E-28
263.62E-28
274.43E-28
286.68E-28
291.24E-27
302.85E-27
318.00E-27
322.75E-26
331.16E-25
345.91E-25
353.67E-24
362.76E-23
372.51E-22
382.75E-21
393.62E-20
405.74E-19
411.10E-17
422.53E-16
437.06E-15
442.40E-13
451.01E-11
465.30E-10
473.64E-08
483.53E-06
496.17E-04
500.00E+00
Sheet1
Predicate I for 1E-9 reliability
Predicate II for 1E-12 reliability
#processes in system
fanout
Fanout required for a specified reliability
Sheet2
Predicate I
Predicate II
fanout
P{failure}
Effects of fanout on reliability
Sheet3
Predicate I
Predicate II
#processes in system
P{failure}
Scalability of Pbcast reliability
number of processes to deliver pbcast
p{#processes=k}
Pbcast bimodal delivery distribution
Failure analysisSuppose someone tells me what they hope to avoidModel as a predicate on final system stateCan compute the probability that pbcast would terminate in that state, again from the model
Two predicatesPredicate I: A faulty outcome is one where more than 10% but less than 90% of the processes get the multicast call this the Generals predicate because he views it as a disaster if many but not most of his troops attack
Two predicatesPredicate II: A faulty outcome is one where roughly half get the multicast and failures might conceal true outcome this would make sense if using pbcast to distribute quorum-style updates to replicated data. The costly hence undesired outcome is the one where we need to rollback because outcome is uncertain
Two predicatesPredicate I: More than 10% but less than 90% of the processes get the multicast Predicate II: Roughly half get the multicast but crash failures might conceal outcomeEasy to add your own predicate. Our methodology supports any predicate over final system state
Chart2
0.0000017650.0000000178
0.00001619960
0.00000057090
0.00000165980
0.00000004960
0.00000009970
0.00000000310
0.00000000540
0.00000000020
0.00000000030
00
Predicate I
Predicate II
#processes in system
P{failure}
Scalability of Pbcast reliability
Sheet1
00.00E+00
206.6298838.34570312.19E-036.43E-01101.78E-081.77E-0612.92E-02
255.9189458.653321.53.02E-031.02E+00152.18E-111.62E-0525.82E-04
305.440437.3408221.47E-039.52E-01203.42E-135.71E-0731.02E-05
355.0712897.5048832.53.92E-045.38E-01257.87E-161.66E-0641.78E-07
404.7636726.77343837.05E-052.07E-01308.89E-184.96E-0853.29E-09
454.531256.8759773.59.48E-066.13E-02352.60E-209.97E-0866.67E-11
504.3193366.33593841.01E-061.51E-02402.17E-223.10E-0971.53E-12
4.59.00E-083.26E-03458.26E-255.38E-0984.05E-14
56.85E-096.40E-04505.80E-271.76E-1091.26E-15
5.54.60E-101.16E-04553.67E-292.82E-10104.63E-17
62.88E-111.98E-05601.65E-311.56E-11112.04E-18
6.51.91E-123.20E-06121.08E-19
71.94E-135.07E-07136.95E-21
7.54.57E-148.72E-08145.45E-22
82.09E-142.03E-08155.23E-23
8.51.35E-147.74E-09166.16E-24
91.07E-144.37E-09178.93E-25
9.59.55E-153.08E-09181.60E-25
109.01E-152.50E-09193.53E-26
209.65E-27
213.27E-27
221.37E-27
237.13E-28
244.59E-28
253.67E-28
263.62E-28
274.43E-28
286.68E-28
291.24E-27
302.85E-27
318.00E-27
322.75E-26
331.16E-25
345.91E-25
353.67E-24
362.76E-23
372.51E-22
382.75E-21
393.62E-20
405.74E-19
411.10E-17
422.53E-16
437.06E-15
442.40E-13
451.01E-11
465.30E-10
473.64E-08
483.53E-06
496.17E-04
500.00E+00
Sheet1
00
00
00
00
00
00
00
Predicate I for 1E-9 reliability
Predicate II for 1E-12 reliability
#processes in system
fanout
Fanout required for a specified reliability
Sheet2
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
Predicate I
Predicate II
fanout
P{failure}
Effects of fanout on reliability
Sheet3
00
00
00
00
00
00
00
00
00
00
00
Predicate I
Predicate II
#processes in system
P{failure}
Scalability of Pbcast reliability
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
number of processes to deliver pbcast
p{#processes=k}
Pbcast bimodal delivery distribution
Chart3
0.6430.002185996
1.020.003018233
0.9520.001467333
0.5380.0003924257
0.2070.0000705303
0.06130.0000094769
0.01510.0000010141
0.003260.00000009
0.000640.0000000068
0.0001160.0000000005
0.00001980
0.00000320
0.0000005070
0.00000008720
0.00000002030
0.00000000770
0.00000000440
0.00000000310
0.00000000250
Predicate I
Predicate II
fanout
P{failure}
Effects of fanout on reliability
Sheet1
00.00E+00
206.6298838.34570312.19E-036.43E-01101.78E-081.77E-0612.92E-02
255.9189458.653321.53.02E-031.02E+00152.18E-111.62E-0525.82E-04
305.440437.3408221.47E-039.52E-01203.42E-135.71E-0731.02E-05
355.0712897.5048832.53.92E-045.38E-01257.87E-161.66E-0641.78E-07
404.7636726.77343837.05E-052.07E-01308.89E-184.96E-0853.29E-09
454.531256.8759773.59.48E-066.13E-02352.60E-209.97E-0866.67E-11
504.3193366.33593841.01E-061.51E-02402.17E-223.10E-0971.53E-12
4.59.00E-083.26E-03458.26E-255.38E-0984.05E-14
56.85E-096.40E-04505.80E-271.76E-1091.26E-15
5.54.60E-101.16E-04553.67E-292.82E-10104.63E-17
62.88E-111.98E-05601.65E-311.56E-11112.04E-18
6.51.91E-123.20E-06121.08E-19
71.94E-135.07E-07136.95E-21
7.54.57E-148.72E-08145.45E-22
82.09E-142.03E-08155.23E-23
8.51.35E-147.74E-09166.16E-24
91.07E-144.37E-09178.93E-25
9.59.55E-153.08E-09181.60E-25
109.01E-152.50E-09193.53E-26
209.65E-27
213.27E-27
221.37E-27
237.13E-28
244.59E-28
253.67E-28
263.62E-28
274.43E-28
286.68E-28
291.24E-27
302.85E-27
318.00E-27
322.75E-26
331.16E-25
345.91E-25
353.67E-24
362.76E-23
372.51E-22
382.75E-21
393.62E-20
405.74E-19
411.10E-17
422.53E-16
437.06E-15
442.40E-13
451.01E-11
465.30E-10
473.64E-08
483.53E-06
496.17E-04
500.00E+00
Sheet1
00
00
00
00
00
00
00
Predicate I for 1E-9 reliability
Predicate II for 1E-12 reliability
#processes in system
fanout
Fanout required for a specified reliability
Sheet2
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
Predicate I
Predicate II
fanout
P{failure}
Effects of fanout on reliability
Sheet3
00
00
00
00
00
00
00
00
00
00
00
Predicate I
Predicate II
#processes in system
P{failure}
Scalability of Pbcast reliability
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
number of processes to deliver pbcast
p{#processes=k}
Pbcast bimodal delivery distribution
Chart5
8.3457036.629883
8.653325.918945
7.340825.44043
7.5048835.071289
6.7734384.763672
6.8759774.53125
6.3359384.319336
Predicate I for 1E-8 reliability
Predicate II for 1E-12 reliability
#processes in system
fanout
Fanout required for a specified reliability
Sheet1
00.00E+00
206.6298838.34570312.19E-036.43E-01101.78E-081.77E-0612.92E-02
255.9189458.653321.53.02E-031.02E+00152.18E-111.62E-0525.82E-04
305.440437.3408221.47E-039.52E-01203.42E-135.71E-0731.02E-05
355.0712897.5048832.53.92E-045.38E-01257.87E-161.66E-0641.78E-07
404.7636726.77343837.05E-052.07E-01308.89E-184.96E-0853.29E-09
454.531256.8759773.59.48E-066.13E-02352.60E-209.97E-0866.67E-11
504.3193366.33593841.01E-061.51E-02402.17E-223.10E-0971.53E-12
4.59.00E-083.26E-03458.26E-255.38E-0984.05E-14
56.85E-096.40E-04505.80E-271.76E-1091.26E-15
5.54.60E-101.16E-04553.67E-292.82E-10104.63E-17
62.88E-111.98E-05601.65E-311.56E-11112.04E-18
6.51.91E-123.20E-06121.08E-19
71.94E-135.07E-07136.95E-21
7.54.57E-148.72E-08145.45E-22
82.09E-142.03E-08155.23E-23
8.51.35E-147.74E-09166.16E-24
91.07E-144.37E-09178.93E-25
9.59.55E-153.08E-09181.60E-25
109.01E-152.50E-09193.53E-26
209.65E-27
213.27E-27
221.37E-27
237.13E-28
244.59E-28
253.67E-28
263.62E-28
274.43E-28
286.68E-28
291.24E-27
302.85E-27
318.00E-27
322.75E-26
331.16E-25
345.91E-25
353.67E-24
362.76E-23
372.51E-22
382.75E-21
393.62E-20
405.74E-19
411.10E-17
422.53E-16
437.06E-15
442.40E-13
451.01E-11
465.30E-10
473.64E-08
483.53E-06
496.17E-04
500.00E+00
Sheet1
00
00
00
00
00
00
00
Predicate I for 1E-8 reliability
Predicate II for 1E-12 reliability
#processes in system
fanout
Fanout required for a specified reliability
Sheet2
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
00
Predicate I
Predicate II
fanout
P{failure}
Effects of fanout on reliability
Sheet3
00
00
00
00
00
00
00
00
00
00
00
Predicate I
Predicate II
#processes in system
P{failure}
Scalability of Pbcast reliability
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
number of processes to deliver pbcast
p{#processes=k}
Pbcast bimodal delivery distribution
DiscussionWe see that pbcast is indeed bimodal even in worst case, when initial multicast failsCan easily tune parameters to obtain desired guarantees of reliabilityProtocol is suitable for use in applications where bounded risk of undesired outcome is sufficient
Model makes assumptions...These are rather simplisticYet the model seems to predict behavior in real networks, anyhowIn effect, the protocol is not merely robust to process perturbation and message loss, but also to perturbation of the model itselfSpeculate that this is due to the incredible power of exponential convergence...
Experimental workSP2 is a large networkNodes are basically UNIX workstationsInterconnect is basically an ATM networkSoftware is standard Internet stack (TCP, UDP)We obtained access to as many as 128 nodes on Cornell SP2 in Theory CenterRan pbcast on this, and also ran a second implementation on a real network
Example of a questionCreate a group of 8 membersPerturb one member in style of Figure 1Now look at stability of throughputMeasure rate of received messages during periods of 100ms eachPlot histogram over life of experiment
Chart4
00
0.97019867550.9735099338
0.02317880790.0198675497
00.0066225166
0.00331125830
0.00331125830
00
00
00
00
00
00
00
00
&A
Page &P
Pbcast with .05 sleep probability
Pbcast with .45 sleep probability
Inter-arrival spacing (ms)
Probability of occurence
Histogram of throughput for pbcast
Histograms
fifo/.05BinFrequencyfifo/.45BinFrequencyPbcast/.05BinFrequencyPbcast/.45BinFrequency
Traditional Protocol with .05 sleep probabilityTraditional Protocol with .45 sleep probabilityPbcast with .05 sleep probabilityPbcast with .45 sleep probability
0.001870.005680.0046280.00530.0059370.00500.0060.0050
0.0031150.012130.0047730.01310.0059480.012930.0060080.01294
0.003410.015170.0049740.015290.0060080.01570.0060210.0156
0.0037610.0200.0051870.02230.0060250.0200.0060470.022
0.004010.02500.005270.02580.0060290.02510.0060660.0250
0.0042520.0300.0054040.0350.006030.0310.0060660.030
0.0042830.03500.0055680.03530.0060370.03500.0060660.0350
0.0045610.0400.0057090.0410.0060370.0400.0060660.040
0.004570.04500.0058120.04520.0060470.04500.0060670.0450
0.0045740.0500.0059050.0510.006050.0500.0060680.050
0.0045790.05500.0059310.05500.0060520.05500.0060690.0550
0.0045980.0610.0059450.0600.0060530.0600.0060740.060
0.0046010.06500.0060270.06520.0060540.06500.0060760.0650
0.0046040.0700.0062470.0700.0060540.0700.0060770.070
0.004611More00.006373More00.006057More00.006079More0
0.0046160.0064670.0060570.006079
0.0046220.0066860.0060580.006083
0.0046370.006760.0060590.006083
0.0046420.0070870.006060.006084
0.0046420.0074380.0060620.006085Traditional Protocol with .05 sleep probabilityTraditional Protocol with .45 sleep probabilityPbcast with .05 sleep probabilityPbcast with .45 sleep probability
0.0046630.0079020.0060640.0060860.00568300
0.0046640.0080030.0060650.0060870.0121331293294
0.0046730.0080320.006070.0060890.015172976
0.0046740.0080760.0060710.0060910.0202302
0.0046740.0080930.0060710.0060920.0250810
0.0046790.0088210.0060740.0060970.030510
0.0046820.0091610.0060740.0060970.0350300
0.0046840.0094390.0060760.0061020.040100
0.0046840.0095980.0060760.0061020.0450200
0.004690.0096980.0060770.0061060.050100
0.0046950.0097020.0060770.0061070.0550000
0.0046990.0098620.0060780.0061070.061000
0.0047180.009930.0060790.0061110.0650200
0.0047240.0099570.0060790.0061110.070000
0.0047340.010050.0060830.006112
0.0047350.0101240.0060830.006113299108302302
0.0047390.0106250.0060840.006113
0.0047630.0106520.0060850.006116
0.0047650.0108220.0060850.006117
0.0047660.0112220.0060850.006117Traditional Protocol with .05 sleep probabilityTraditional Protocol with .45 sleep probabilityPbcast with .05 sleep probabilityPbcast with .45 sleep probability
0.0047690.0112510.0060860.0061180.00568300
0.0047710.0115180.0060870.0061180.0121331293294
0.0047750.0117170.0060870.006120.015172976
0.004780.01190.0060880.006120.0202302
0.0047810.0120050.0060880.0061210.0250810
0.0047970.01210.0060890.0061210.030510
0.0048040.0121780.0060890.0061220.0350300
0.0048210.0123390.0060890.0061240.040100
0.0048240.0124610.006090.0061240.0450200
0.0048270.0126670.006090.0061240.050100
0.0048290.0127050.0060910.0061240.0550000
0.0048380.0127760.0060910.0061250.061000
0.0048470.012880.0060920.0061260.0650200
0.004850.0128810.0060920.0061260.070000
0.0048620.0129590.0060920.006126
0.0048630.013270.0060930.006126299108302302
0.0049160.0135170.0060950.006127
0.0049190.0135390.0060950.006129Traditional Protocol with .05 sleep probabilityTraditional Protocol with .45 sleep probabilityPbcast with .05 sleep probabilityPbcast with .45 sleep probability
0.0049210.013810.0060970.0061310.0050.22742474920.027777777800
0.0049280.0142010.0060980.0061330.010.71237458190.2870370370.97019867550.9735099338
0.0049290.0143540.0060990.0061330.0150.05685618730.26851851850.02317880790.0198675497
0.0049310.0144220.00610.0061330.0200.21296296300.0066225166
0.0049380.0148380.00610.0061340.02500.07407407410.00331125830
0.0049420.0152750.0061010.0061360.0300.04629629630.00331125830
0.0049440.0153410.0061010.0061370.03500.027777777800
0.0049570.0156270.0061020.0061380.0400.009259259300
0.004960.0156410.0061030.006140.04500.018518518500
0.0049610.0157230.0061030.0061410.0500.009259259300
0.0050030.0157680.0061030.0061410.0550000
0.005010.0158320.0061040.0061420.060.0033444816000
0.0050160.0159540.0061040.0061430.06500.018518518500
0.0050320.0162180.0061050.0061430.070000
0.0050390.0162650.0061050.006144
0.0050460.0164180.0061050.006144
0.0050460.0168060.0061050.006144
0.0050590.0168380.0061060.006145
0.0050680.016980.0061060.006145
0.0050830.0174190.0061060.006145
0.0051180.0174540.0061070.006146
0.0051290.0176560.0061070.006146
0.0051310.0182410.0061070.006146
0.0051510.0186210.0061090.006147
0.0051710.0186810.0061090.006148
0.0051720.0189150.006110.006148
0.0051950.0191280.0061110.006149
0.0051960.0199140.0061110.006149
0.0051970.0206180.0061110.00615
0.0052340.0206230.0061110.00615
0.0052530.0209220.0061110.00615
0.005260.0209340.0061120.006151
0.0052820.0211980.0061130.006151
0.0052880.0213060.0061130.006152
0.0052940.0219240.0061140.006153
0.0053230.0222670.0061140.006153
0.0053360.0254720.0061140.006153
0.0053520.0264850.0061140.006154
0.0053610.0266770.0061150.006154
0.0054680.0284970.0061150.006154
0.0055610.0294520.0061150.006155
0.0055940.0316370.0061150.006155
0.0056340.0317270.0061150.006156
0.0056470.0320650.0061160.006157
0.0056550.0369850.0061170.006158
0.0056560.0417480.0061180.006158
0.0056690.0439830.0061180.006158
0.0056810.0450930.0061180.006159
0.0056930.0625090.0061180.006159
0.0057060.0628730.006120.00616
0.005720.006120.006162
0.0057250.006120.006163
0.0057270.0061210.006163
0.0057320.0061220.006164
0.0057410.0061220.006166
0.0057440.0061230.006166
0.0058090.0061230.006166
0.0058320.0061230.006166
0.005840.0061230.006166
0.0058430.0061240.006168
0.0058530.0061240.00617
0.0058550.0061250.00617
0.005870.0061250.00617
0.0058720.0061250.006171
0.0058840.0061250.006172
0.0058940.0061260.006172
0.0059070.0061270.006173
0.0059360.0061280.006173
0.005940.0061280.006173
0.0059580.0061290.006174
0.0059690.0061290.006175
0.0059860.0061290.006176
0.0060.0061290.006176
0.0060080.0061290.006177
0.0060150.0061290.006179
0.0060250.0061290.00618
0.0060340.006130.006181
0.0060440.006130.006181
0.0060510.0061310.006185
0.0060660.0061320.006186
0.0060820.0061320.006187
0.0060910.0061320.006187
0.0060960.0061340.006187
0.0061060.0061340.006188
0.0061140.0061360.006189
0.0061310.0061370.006189
0.0061490.0061380.00619
0.0061590.0061380.006191
0.0061710.0061380.006192
0.0061950.0061390.006193
0.0062010.006140.006193
0.0062160.006140.006193
0.0062190.0061410.006195
0.0062350.0061420.006196
0.0062420.0061420.006199
0.0062540.0061420.006201
0.0062630.0061420.006202
0.0062730.0061430.006202
0.0062790.0061430.006203
0.006280.0061430.006204
0.0062910.0061430.006204
0.0063210.0061450.006205
0.0063320.0061450.006205
0.0063390.0061460.006206
0.0063450.0061470.006206
0.0063470.0061470.006207
0.0063480.0061470.006208
0.006350.0061490.006208
0.0063550.006150.006208
0.0063560.0061510.006211
0.006370.0061510.006211
0.006380.0061520.006214
0.0063870.0061530.006216
0.0063880.0061530.006218
0.0063910.0061540.006221
0.0063960.0061540.006222
0.0064090.0061550.006223
0.0064180.0061550.006223
0.0064210.0061550.006224
0.0064220.0061550.006224
0.0064340.0061560.006224
0.0064450.0061560.006227
0.0064470.0061560.00623
0.0064470.0061570.006231
0.0064510.0061570.006234
0.0064690.0061570.006234
0.0064690.0061580.006235
0.0064740.0061590.006235
0.0064810.0061590.006236
0.0064810.0061610.006237
0.0064860.0061620.006238
0.0064940.0061620.00624
0.0064970.0061630.00624
0.0065090.0061640.006241
0.0065170.0061640.006242
0.006520.0061650.006247
0.0065280.0061650.006247
0.0065630.0061650.00625
0.0065770.0061670.00625
0.0065970.0061680.006252
0.006630.0061690.006252
0.0066380.0061690.006254
0.0066550.0061720.006254
0.0066640.0061720.006254
0.0066650.0061730.006255
0.0066650.0061740.006256
0.0066810.0061740.006258
0.006690.0061750.00626
0.0067040.0061760.006261
0.0067380.0061760.006262
0.006760.0061770.006263
0.0067620.0061770.006263
0.0067660.0061790.006264
0.0067850.006180.006266
0.0067960.006180.006268
0.0068020.0061810.006268
0.0068370.0061810.006268
0.0068770.0061820.006271
0.0068820.0061820.006272
0.0069320.0061820.006274
0.0069440.0061830.006274
0.0069620.0061840.006274
0.006970.0061840.006276
0.0069780.0061850.006276
0.0069840.0061860.006279
0.0070170.0061940.006281
0.0070320.0061950.006283
0.0070570.0061980.006283
0.0070910.0061980.006289
0.0070950.0061990.006289
0.0071230.0062030.006291
0.0071390.0062050.006292
0.0071480.0062060.006293
0.007150.0062070.006295
0.0071780.0062070.006299
0.0071780.0062080.006301
0.0071950.0062080.006302
0.0072520.0062090.006302
0.0072660.0062120.006303
0.0073430.0062120.006303
0.0073430.0062170.006308
0.0073470.0062180.006313
0.0073490.0062180.006318
0.0073570.0062210.006322
0.0074040.0062210.006323
0.0076160.0062250.006328
0.0076390.0062250.006335
0.0077020.0062250.006338
0.0077290.0062250.006342
0.0078370.0062280.006345
0.0079060.0062320.006348
0.0079070.0062330.006356
0.0079720.0062340.006367
0.0080220.0062340.006375
0.0080290.0062360.006381
0.0081680.0062490.006381
0.0081950.0062640.006384
0.0082360.006270.006386
0.0082380.0062770.006387
0.0082420.0062890.006389
0.0082580.0062980.006392
0.0082750.0063080.006435
0.0082990.0063270.00645
0.0083840.0063340.006461
0.0084050.0063370.006463
0.0084890.0063370.006477
0.0085030.0063420.006483
0.0085470.0063670.006513
0.0085970.0063740.006566
0.0085980.0063940.006603
0.0086010.0064260.006642
0.0087170.0064260.006692
0.008720.0064780.006711
0.008890.0064920.006847
0.0089120.006510.00686
0.0089650.0065450.00686
0.009020.0066230.006887
0.009260.006710.006957
0.0093070.0067880.007104
0.009590.0069060.007113
0.009640.0070740.007295
0.0099740.0070970.007297
0.0099940.0071120.007391
0.0100220.0072970.007416
0.010040.0073110.007757
0.0100790.0074690.00821
0.0101070.0074820.008302
0.010240.0075290.008333
0.010530.0079130.008349
0.0109420.007930.0085
0.0113920.007980.00876
0.0115780.0082320.008958
0.0116620.0086610.00903
0.0116820.0087670.009121
0.0118010.0092540.009175
0.0119930.0101640.009281
0.0137070.0104310.010412
0.0138190.0110130.011389
0.0140370.0127690.011573
0.0144880.0129210.012047
0.0566960.0135720.013404
0.0139560.0143
0.021560.018217
0.0280.01829
&A
Page &P
Histograms
00
00
00
00
00
00
00
00
00
00
00
00
00
00
&A
Page &P
Traditional Protocol with .05 sleep probability
Traditional Protocol with .45 sleep probability
Time to receive 100 messages
Probability of occurence
Histogram of throughput for Traditional Protocol
g10.003fifo
00
00
00
00
00
00
00
00
00
00
00
00
00
00
&A
Page &P
Pbcast with .05 sleep probability
Pbcast with .45 sleep probability
Time to receive 100 messages
Probability of occurence
Histogram of throughput for PBCast
1\1\g1\5\g2\5\g
Fifo/highPbcast/highFifo/lowPbcast/lowFIFO/hPbcast/hFIFO/lPbcast/lFIFO/hPbcast/hFIFO/lPbcast/l
Traditional w/1 sleeperPbcast w/1 sleeperTraditional w/1 sleeperPbcast w/1 sleeperTraditional w/5 sleepersPbcast w/5 sleepersTraditional w/5 sleepersPbcast w/5 sleepers
0.05151.262153.82599.999699.9981123.82150.75799.999999.999277.5588265.736161.412199.997
0.15126.037153.98699.999799.998174.0949153.16100.00299.997771.1499264.722136.632199.995
0.25101.96155.231100.00499.640765.026150.35796.038399.99967.6259262.719116.164199.997
0.3577.0642145.21999.999799.999750.6761150.62573.306199.995559.6272267.606106.311199.992
0.4563.9061153.02996.421199.994439.7611153.33158.5499.995752.2691260.94587.0828199.996
0.5550.3154152.36775.08999.998431.1254151.62743.980499.996431.1254151.62743.980499.9964
0.6539.4076153.88853.555599.99821.6599153.26533.497899.996921.6599153.26533.497899.9969
0.7526.2399149.92739.227299.999114.7746153.56323.037596.84914.7746153.56323.037596.849
0.8516.0008153.21722.373599.99799.07249152.90212.206499.99879.07249152.90212.206499.9987
0.955.67649153.628.4509999.99772.9353156.2562.935399.99922.9353156.2562.935399.9992
1/1/b
FIFO/hPbcast/hFIFO/lPbcast/l
1\3\gThroughput for traditional protocol, measured at faulty hostThroughput for Pbcast, measured at faulty host
Traditional w/3 sleepersPbcast w 3/sleepers151.261153.83199.9998100.003
154.991153.64399.9998153.25799.8381100.001
112.362151.774102.277149.679100.00499.2005
80.445151.8778.8386126.51799.999498.7813
64.5427149.9563.8418116.21895.878116.218
50.2844155.49250.027899.75174.765880.7126Throughput for traditional protocol, measured at correct host
41.3491151.82839.078177.962453.128863.896
25.3831153.22625.773953.385638.951343.8917Throughput for PBCast, measured at correct host
19.4144150.58415.333730.327821.954330.3278
9.07342153.4564.879068.210887.879528.21088
3.4324152.63
&A
Page &P
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
&A
Page &P
Throughput for traditional protocol, measured at correct host
Throughput for PBCast, measured at correct host
Throughput for traditional protocol, measured at faulty host
Throughput for Pbcast, measured at faulty host
Probability of Sleep Event
Average Throughput
High Bandwidth comparison of PBCast performance atfaulty and correct hosts
000000
000000
000000
000000
000000
000000
000000
000000
000000
000000
&A
Page &P
Traditional w/1 sleeper
Pbcast w/1 sleeper
Traditional w/3 sleepers
Pbcast w 3/sleepers
Traditional w/5 sleepers
Pbcast w/5 sleepers
Probability of sleep event
Throughput measured at unperturbed process
High Bandwidth measurements with varying numbers of sleepers
0000
0000
0000
0000
0000
0000
0000
0000
0000
0000
&A
Page &P
Traditional w/1 sleeper
Pbcast w/1 sleeper
Traditional w/5 sleepers
Pbcast w/5 sleepers
Probability of Sleep Event
Average Throughput
Low Bandwidth measurements with varying numbers of sleepers
Now revisit Figure 1 in detailTake 8 machinesPerturb 1Pump data in at varying rates, look at rate of received messages
Throughput with perturbed processes
Throughput variation as a function of scale
Impact of packet loss on reliability and retransmission rateNotice that when network becomes overloaded, healthy processes experience packet loss!
What about growth of overhead?Look at messages other than original data distribution multicastMeasure worst case scenario: costs at main generator of multicastsSide remark: all of these graphs look identical with multiple senders or if overhead is measured elsewhere.
Growth of Overhead?Clearly, overhead does growWe know it will be bounded except for probabilistic phenomenaAt peak, load is still fairly low
Pbcast versus SRM, 0.1% packet loss rate on all linksTree networks
Star networks
Pbcast versus SRM: link utilization
Pbcast versus SRM: 300 members on a 1000-node tree, 0.1% packet loss rate
Pbcast Versus SRM: Interarrival Spacing
Pbcast versus SRM: Interarrival spacing (500 nodes, 300 members, 1.0% packet loss)
Real Data: Spinglass on a 10Mbit ethernet (35 Ultrasparcs)Injected noise, retransmission limit disabledInjected noise, retransmission limit re-enabled
0
10
20
30
40
50
60
70
80
90
100
0
20
40
60
80
100
120
140
160
180
200
sec
Networks structured as clusters
Delivery latency in a 2-cluster LAN, 50% noise between clusters, 1% elsewhere
Requests/repairs and latencies with bounded router bandwidth
DiscussionSaw that stability of protocol is exceptional even under heavy perturbationOverhead is low and stays low with system size, bounded even for heavy perturbationThroughput is extremely steadyIn contrast, virtual synchrony and SRM both are fragile under this sort of attack
Programming with pbcast?Most often would want to split application into multiple subsystemsUse pbcast for subsystems that generate regular flow of data and can tolerate infrequent loss if risk is boundedUse stronger properties for subsystems with less load and that need high availability and consistency at all times
Programming with pbcast?In stock exchange, use pbcast for pricing but abcast for control operationsIn hospital use pbcast for telemetry data but use abcast when changing medicationIn air traffic system use pbcast for routine radar track updates but abcast when pilot registers a flight plan change
Our vision: One protocol side-by-side with the otherUse virtual synchrony for replicated data and control actions, where strong guarantees are needed for safetyUse pbcast for high data rates, steady flows of information, where longer term properties are critical but individual multicast is of less critical importance
Next stepsCurrently exploring memory use patterns and scheduling issues for the protocolWill study:How much memory is really needed? Can we configure some processes with small buffers?When to use IP multicast and what TTL to useExploiting network topology informationApplication feedback options
SummaryNew data point in spectrumVirtual synchrony protocols (atomic multicast as normally implemented)Bimodal probabilistic multicastScalable reliable multicastDemonstrated that pbcast is suitable for analytic workSaw that it has exceptional stability
EpilogueBaron: Although successful, your attack was reckless! It is impossible to achieve consensus on attack under such conditions!General: I merely took a calculated risk.