Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of...
-
date post
21-Dec-2015 -
Category
Documents
-
view
218 -
download
1
Transcript of Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of...
Scalable Group Communication for the Internet
Idit Keidar
MIT Lab for Computer Science
Theory of Distributed Systems GroupThe main part of this talk is joint work with
Sussman, Marzullo and Dolev
Collaborators
• Tal Anker• Ziv Bar-Joseph• Gregory Chockler• Danny Dolev• Alan Fekete• Nabil Huleihel• Kyle Ingols• Roger Khazan• Carl Livadas
• Nancy Lynch• Keith Marzullo• Yoav Sasson• Jeremy Sussman• Alex Shvartsman• Igor Tarashchanskiy• Roman Vitenberg• Esti Yeger-Lotem
Outline
• Motivation• Group communication - background• A novel architecture for scalable group
communication services in WAN• A new scalable group membership algorithm
– Specification– Algorithm– Implementation – Performance
• Conclusions
Modern Distributed Applications (in WANs)
• Highly available servers– Web – Video-on-Demand
• Collaborative computing– Shared white-board, shared editor, etc.– Military command and control– On-line strategy games
• Stock market
Important Issues in Building Distributed Applications
• Consistency of view– Same picture of game, same shared file
• Fault tolerance, high availability• Performance
– Conflicts with consistency?
• Scalability– Topology - WAN, long unpredictable delays– Number of participants
Generic Primitives - Middleware, “Building Blocks”
• E.g., total order, group communication
• Abstract away difficulties, e.g., – Total order - a basis for replication – Mask failures
• Important issues:– Well specified semantics - complete– Performance
Research Approach
• Rigorous modeling, specification, proofs, performance analysis
• Implementation and performance tuning
• Services Applications
• Specific examples General observations
G
Send(G)
Group Communication
• Group abstraction - a group of processes is one logical entity
• Dynamic Groups (join, leave, crash)Systems: Ensemble, Horus, ISIS, Newtop, Psync,
Sphynx, Relacs, RMP, Totem, Transis
Virtual Synchrony[Birman, Joseph 87]
• Group members all see events in same order– Events: messages, process crash/join
• Powerful abstraction for replication
• Framework for fault tolerance, high availability
• Basic component: group membership– Reports changes in set of group members
Example: Highly Available VoD[Anker, Dolev, Keidar ICDCS1999]
• Dynamic set of servers
• Clients talk to “abstract” service
• Server can crash, client shouldn’t know
client
VoD Service
server2
server 1
client
VoD Service
server 2
server 1
VoD Service: Exploiting Group Communication
Group abstraction for connection establishment and transparent migration (with simple clients)
Membership services detect conditions for migration - fault tolerance and load balancing
Reliable group multicast among servers for consistently sharing information Virtual Synchrony allows servers to agree upon
migration immediately (no message exchange)
Reliable messages for control • Server: ~2500 C++ lines
– All fault tolerance logic at server
Related Projects
Moshe: Group MembershipICDCS 00
Architecture for Group Membership in WAN
DIMACS 98
Specification Survey
99
Virtual SynchronyICDCS 00
Inheritance-basedModelingICSE 00
Object ReplicationPODC 96
CSCWNGITS 97
Highly Available VoDICDCS 99
Group communicationApplications
Dynamic VotingPODC 97
QoS SupportTINA 96,
OPODIS 00
Optimistic VSSRDS 00
A Scalable Architecture for Group Membership in WANs
Tal Anker, Gregory Chockler, Danny Dolev, Idit Keidar
DIMACS Workshop 1998
Scalable Membership Architecture
• Dedicated distributed membership servers “divide and conquer” – Servers involved only in membership changes– Members communicate with each other directly
(implement “virtual synchrony”)
• Two levels of membership– Notification Service NSView - “who is around”– Agreed membership views
Architecture
Process D
Process B
Process E
Process A
Process CW AN
M em berhsipServer
M em bershipServer
NSView:"Who is around"failure/join/leave
Agreed View:Members set and identifier
Notification Service (NS)
Membership{A,B,C,D,E},7
Notification Service (NS)
Membership{A,B,C,D,E},7
The Notification Service (NS)
• Group members send requests: – join(Group G), – leave(Group G)
directly to (local) NS• NS detects faults (member / domain)• Information propagated to all NS servers• NS servers notify membership servers of
new NSView
The NS Communication: Reliable FIFO links
• Membership servers can send each other messages using NS
• FIFO orderIf S1 sends m1 and later m2 then any server
which receives both, receives m1 first.
• Reliable linksIf S1 sends m to S2 then eventually
either S2 receives m or S1 suspects S2
(and all of its clients).
Moshe: A Group Membership Algorithm for WANs
Idit Keidar, Jeremy SussmanKeith Marzullo, Danny Dolev
ICDCS 2000
Membership in WAN: the Challenge
• Message latency is large and unpredictable
• Frequent message loss Time-out failure detection is inaccurate
We use a notification service (NS) for WANs
Number of communication rounds matters Algorithms may change views frequently View changes require communication for state
transfer, which is costly in WAN
Moshe’s Novel Concepts
• Designed for WANs from the ground up
– Previous systems emerged from LAN
• Avoids delivery of “obsolete” views– Views that are known to be changing
– Not always terminating (but NS is)
• Runs in a single round (“typically”)
start
Change
Vie
w
M em ber
M osheN S
server
Member-Server Interaction
Moshe Guarantees• View - <members, identifier>
– Identifier is monotonically increasing
• Conditional liveness property: Agreement on views If “all” eventually have the same last NSView
then “all” eventually agree on the last view
obsoleteviews
• Composable Allows reasoning about individual componentsUseful for applications
Moshe Operation: Typical Case
• In response to new NSView (members), – send proposal to other servers with NSView – send startChange to local members (clients)
• Once proposals from all servers of NSView members arrive, deliver view: – members - NSView, – identifier higher than all previous
to local members
Goal: Self Stabilizing
• Once the same last NSView is received by all servers:– All send proposals for this NSView– All the proposals reach all the servers– All servers use these proposals to deliver the
same view
And they live happily ever after!
To avoid deadlock: A must respond
Out-of-Sync Case: unexpected proposal
X
proposal +c
X-c
proposal
A B C
Extra proposals are redundant, responding with a proposal may cause live-lock
Out-of-Sync Case: unexpected proposal
A B C
-C
+C
+C +AB+C
viewview
Out-of-Sync Case: missing proposal
A B C
-C+C
+C +AB+C
view view
view
This case exposed by correctness proof
Missing Proposal Detection
A B C
+C +AB+C1 1 1
view
Used: 1,1,1
-C+C 2, [111]
Used[C] =1= PropNum Detection!
PropNum:1
Handling Out-of-Sync Cases:“Slow Agreement”
• Also sends proposals, tagged “SA”
• Invoked upon blocking detection or upon receipt of “SA” proposal
• Upon receipt of “SA” proposal with bigger number than PropNum, respond with same number
• Deliver view only with “full house” of same number proposals
How Typical is the “typical” Case?
• Depends on the notification service (NS)– Classify NS good behaviors: symmetric and
transitive perception of failures
• Transitivity depends on logical topology, how suspicions propagate
• Typical case should be very common
• Need to measure
Implementation
• Use CONGRESS [Anker et al]– NS for WAN – Always symmetric, can be non-transitive– Logical topology can be configured
• Moshe servers extend CONGRESS servers
• Socket interface with processes
The Experiment
• Run over the Internet – In the US: MIT, Cornell (CU), UCSD– In Taiwan: NTU– In Israel: HUJI
• Run for 10 days in one configuration, 2.5 days in another
• 10 clients at each location – continuously join/leave 10 groups
Two Experiment Configurations
M IT
U C S D
C U
N T U
H U J I
M IT
U C S D
C U
N T U
H U J I
Percentage of “Typical” Cases
• Configuration 1: – MIT: 10,786 views, 10,661 one round - 98.84%– Other sites: 98.8%, 98.9%, 98.97%, 98.85%
• Configuration 2:– MIT: 2,559 views, 2,555 one round - 99.84%– Other sites: 99.82%, 99.79%, 99.81%, 99.84%
• Overwhelming majority for one round!• Depends on topology can scale
Performance: Surprise!
0
200
400
600
800
1000
1200
1400
020
040
060
080
010
0012
0014
0016
0018
0020
0022
0024
0026
0028
0030
0032
0034
0036
0038
0040
00
milliseconds
num
ber
of r
uns
Histogram of Moshe durationMIT, configuration 1, runs up to 4 seconds (97%)
Performance: Part II
Histogram of Moshe durationMIT, configuration 2, runs up to 3 seconds (99.7%)
0
50
100
150
200
250
300
350
400
450
015
030
045
060
075
090
010
5012
0013
5015
0016
5018
0019
5021
0022
5024
0025
5027
0028
5030
00
milliseconds
num
ber
of r
uns
Performance over the Internet:What is Going on?
• Without message loss, running time is close to biggest round-trip-time, ~650 ms.– As expected
• Message loss has a big impact
• Configuration 2 has much less loss, more cases of good performance
“Slow” versus “Typical”
• Slow can take 1 or 2 rounds once it is run– Depending on PropNum
• Slow after NE– One-round is run first, then detection, and slow– Without loss - 900 ms., 40% more than usual
• Slow without NE– Detection by unexpected proposal– Only slow algorithm is run– Runs less time than one-round
Unstable Periods:No Obsolete Views
• “Unstable” = – constant changes; or– connected processes differ in failure detection
• Configuration 1:– 379 of the 10,786 views 4 seconds, 3.5%– 167 20 seconds, 1.5%– Longest running time 32 minutes
• Configuration 2:– 14 of 2,559 views 4 seconds, 0.5%– Longest running time 31 seconds
Scalability Measurements
• Controlled experiment at MIT and UCSD– Prototype NS, based on TCP/IP (Sasson)– Inject faults to test “slow” case
• Vary number of members, servers• Measure end-to-end latencies at member,
from join/leave/suspicionto corresponding view
• Average of 150 (50 slow) runs
End-to-End Latency: Scalable!• Member
scalability: 4 servers (constant)
Number of Members
200
300
400
20 120
Milliseconds
sync
out-of-sync
70
200
300
400
80 180 280Number of Members
Milliseconds
sync
out-of-sync• Server and member scalability: 4-14 servers
Conclusion: Moshe Features• Avoiding obsolete views • A single round
– 98% of the time in one configuration– 99.8% of the time in another
• Using a notification service for WANs– Good abstraction– Flexibility to configure multiple ways– Future work: configure more ways
• Scalable “divide and conquer” architecture
Retrospective: Role of Theory
• Specification– Possible to implement– Useful for applications (composable)
• Specification can be met in one round “typically” (unlike Consensus)
• Correctness proof exposes subtleties– Need to avoid live-lock – Two types of detection mechanisms needed
Future Work: The QoS Challenge
• Some distributed applications require QoS– Guaranteed available bandwidth– Bounded delay, bounded jitter
• Membership algorithm terminates in one round under certain circumstances– Can we leverage on that to guarantee QoS
under certain assumptions?
• Can other primitives guarantee QoS?