Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of...

44
Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is joint work with Sussman, Marzullo and Dolev
  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    218
  • download

    1

Transcript of Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of...

Page 1: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Scalable Group Communication for the Internet

Idit Keidar

MIT Lab for Computer Science

Theory of Distributed Systems GroupThe main part of this talk is joint work with

Sussman, Marzullo and Dolev

Page 2: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Collaborators

• Tal Anker• Ziv Bar-Joseph• Gregory Chockler• Danny Dolev• Alan Fekete• Nabil Huleihel• Kyle Ingols• Roger Khazan• Carl Livadas

• Nancy Lynch• Keith Marzullo• Yoav Sasson• Jeremy Sussman• Alex Shvartsman• Igor Tarashchanskiy• Roman Vitenberg• Esti Yeger-Lotem

Page 3: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Outline

• Motivation• Group communication - background• A novel architecture for scalable group

communication services in WAN• A new scalable group membership algorithm

– Specification– Algorithm– Implementation – Performance

• Conclusions

Page 4: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Modern Distributed Applications (in WANs)

• Highly available servers– Web – Video-on-Demand

• Collaborative computing– Shared white-board, shared editor, etc.– Military command and control– On-line strategy games

• Stock market

Page 5: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Important Issues in Building Distributed Applications

• Consistency of view– Same picture of game, same shared file

• Fault tolerance, high availability• Performance

– Conflicts with consistency?

• Scalability– Topology - WAN, long unpredictable delays– Number of participants

Page 6: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Generic Primitives - Middleware, “Building Blocks”

• E.g., total order, group communication

• Abstract away difficulties, e.g., – Total order - a basis for replication – Mask failures

• Important issues:– Well specified semantics - complete– Performance

Page 7: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Research Approach

• Rigorous modeling, specification, proofs, performance analysis

• Implementation and performance tuning

• Services Applications

• Specific examples General observations

Page 8: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

G

Send(G)

Group Communication

• Group abstraction - a group of processes is one logical entity

• Dynamic Groups (join, leave, crash)Systems: Ensemble, Horus, ISIS, Newtop, Psync,

Sphynx, Relacs, RMP, Totem, Transis

Page 9: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Virtual Synchrony[Birman, Joseph 87]

• Group members all see events in same order– Events: messages, process crash/join

• Powerful abstraction for replication

• Framework for fault tolerance, high availability

• Basic component: group membership– Reports changes in set of group members

Page 10: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Example: Highly Available VoD[Anker, Dolev, Keidar ICDCS1999]

• Dynamic set of servers

• Clients talk to “abstract” service

• Server can crash, client shouldn’t know

client

VoD Service

server2

server 1

client

VoD Service

server 2

server 1

Page 11: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

VoD Service: Exploiting Group Communication

Group abstraction for connection establishment and transparent migration (with simple clients)

Membership services detect conditions for migration - fault tolerance and load balancing

Reliable group multicast among servers for consistently sharing information Virtual Synchrony allows servers to agree upon

migration immediately (no message exchange)

Reliable messages for control • Server: ~2500 C++ lines

– All fault tolerance logic at server

Page 12: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Related Projects

Moshe: Group MembershipICDCS 00

Architecture for Group Membership in WAN

DIMACS 98

Specification Survey

99

Virtual SynchronyICDCS 00

Inheritance-basedModelingICSE 00

Object ReplicationPODC 96

CSCWNGITS 97

Highly Available VoDICDCS 99

Group communicationApplications

Dynamic VotingPODC 97

QoS SupportTINA 96,

OPODIS 00

Optimistic VSSRDS 00

Page 13: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

A Scalable Architecture for Group Membership in WANs

Tal Anker, Gregory Chockler, Danny Dolev, Idit Keidar

DIMACS Workshop 1998

Page 14: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Scalable Membership Architecture

• Dedicated distributed membership servers “divide and conquer” – Servers involved only in membership changes– Members communicate with each other directly

(implement “virtual synchrony”)

• Two levels of membership– Notification Service NSView - “who is around”– Agreed membership views

Page 15: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Architecture

Process D

Process B

Process E

Process A

Process CW AN

M em berhsipServer

M em bershipServer

NSView:"Who is around"failure/join/leave

Agreed View:Members set and identifier

Notification Service (NS)

Membership{A,B,C,D,E},7

Notification Service (NS)

Membership{A,B,C,D,E},7

Page 16: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

The Notification Service (NS)

• Group members send requests: – join(Group G), – leave(Group G)

directly to (local) NS• NS detects faults (member / domain)• Information propagated to all NS servers• NS servers notify membership servers of

new NSView

Page 17: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

The NS Communication: Reliable FIFO links

• Membership servers can send each other messages using NS

• FIFO orderIf S1 sends m1 and later m2 then any server

which receives both, receives m1 first.

• Reliable linksIf S1 sends m to S2 then eventually

either S2 receives m or S1 suspects S2

(and all of its clients).

Page 18: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Moshe: A Group Membership Algorithm for WANs

Idit Keidar, Jeremy SussmanKeith Marzullo, Danny Dolev

ICDCS 2000

Page 19: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Membership in WAN: the Challenge

• Message latency is large and unpredictable

• Frequent message loss Time-out failure detection is inaccurate

We use a notification service (NS) for WANs

Number of communication rounds matters Algorithms may change views frequently View changes require communication for state

transfer, which is costly in WAN

Page 20: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Moshe’s Novel Concepts

• Designed for WANs from the ground up

– Previous systems emerged from LAN

• Avoids delivery of “obsolete” views– Views that are known to be changing

– Not always terminating (but NS is)

• Runs in a single round (“typically”)

Page 21: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

start

Change

Vie

w

M em ber

M osheN S

server

Member-Server Interaction

Page 22: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Moshe Guarantees• View - <members, identifier>

– Identifier is monotonically increasing

• Conditional liveness property: Agreement on views If “all” eventually have the same last NSView

then “all” eventually agree on the last view

obsoleteviews

• Composable Allows reasoning about individual componentsUseful for applications

Page 23: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Moshe Operation: Typical Case

• In response to new NSView (members), – send proposal to other servers with NSView – send startChange to local members (clients)

• Once proposals from all servers of NSView members arrive, deliver view: – members - NSView, – identifier higher than all previous

to local members

Page 24: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Goal: Self Stabilizing

• Once the same last NSView is received by all servers:– All send proposals for this NSView– All the proposals reach all the servers– All servers use these proposals to deliver the

same view

And they live happily ever after!

Page 25: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

To avoid deadlock: A must respond

Out-of-Sync Case: unexpected proposal

X

proposal +c

X-c

proposal

A B C

Page 26: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Extra proposals are redundant, responding with a proposal may cause live-lock

Out-of-Sync Case: unexpected proposal

A B C

-C

+C

+C +AB+C

viewview

Page 27: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Out-of-Sync Case: missing proposal

A B C

-C+C

+C +AB+C

view view

view

This case exposed by correctness proof

Page 28: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Missing Proposal Detection

A B C

+C +AB+C1 1 1

view

Used: 1,1,1

-C+C 2, [111]

Used[C] =1= PropNum Detection!

PropNum:1

Page 29: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Handling Out-of-Sync Cases:“Slow Agreement”

• Also sends proposals, tagged “SA”

• Invoked upon blocking detection or upon receipt of “SA” proposal

• Upon receipt of “SA” proposal with bigger number than PropNum, respond with same number

• Deliver view only with “full house” of same number proposals

Page 30: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

How Typical is the “typical” Case?

• Depends on the notification service (NS)– Classify NS good behaviors: symmetric and

transitive perception of failures

• Transitivity depends on logical topology, how suspicions propagate

• Typical case should be very common

• Need to measure

Page 31: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Implementation

• Use CONGRESS [Anker et al]– NS for WAN – Always symmetric, can be non-transitive– Logical topology can be configured

• Moshe servers extend CONGRESS servers

• Socket interface with processes

Page 32: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

The Experiment

• Run over the Internet – In the US: MIT, Cornell (CU), UCSD– In Taiwan: NTU– In Israel: HUJI

• Run for 10 days in one configuration, 2.5 days in another

• 10 clients at each location – continuously join/leave 10 groups

Page 33: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Two Experiment Configurations

M IT

U C S D

C U

N T U

H U J I

M IT

U C S D

C U

N T U

H U J I

Page 34: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Percentage of “Typical” Cases

• Configuration 1: – MIT: 10,786 views, 10,661 one round - 98.84%– Other sites: 98.8%, 98.9%, 98.97%, 98.85%

• Configuration 2:– MIT: 2,559 views, 2,555 one round - 99.84%– Other sites: 99.82%, 99.79%, 99.81%, 99.84%

• Overwhelming majority for one round!• Depends on topology can scale

Page 35: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Performance: Surprise!

0

200

400

600

800

1000

1200

1400

020

040

060

080

010

0012

0014

0016

0018

0020

0022

0024

0026

0028

0030

0032

0034

0036

0038

0040

00

milliseconds

num

ber

of r

uns

Histogram of Moshe durationMIT, configuration 1, runs up to 4 seconds (97%)

Page 36: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Performance: Part II

Histogram of Moshe durationMIT, configuration 2, runs up to 3 seconds (99.7%)

0

50

100

150

200

250

300

350

400

450

015

030

045

060

075

090

010

5012

0013

5015

0016

5018

0019

5021

0022

5024

0025

5027

0028

5030

00

milliseconds

num

ber

of r

uns

Page 37: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Performance over the Internet:What is Going on?

• Without message loss, running time is close to biggest round-trip-time, ~650 ms.– As expected

• Message loss has a big impact

• Configuration 2 has much less loss, more cases of good performance

Page 38: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

“Slow” versus “Typical”

• Slow can take 1 or 2 rounds once it is run– Depending on PropNum

• Slow after NE– One-round is run first, then detection, and slow– Without loss - 900 ms., 40% more than usual

• Slow without NE– Detection by unexpected proposal– Only slow algorithm is run– Runs less time than one-round

Page 39: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Unstable Periods:No Obsolete Views

• “Unstable” = – constant changes; or– connected processes differ in failure detection

• Configuration 1:– 379 of the 10,786 views 4 seconds, 3.5%– 167 20 seconds, 1.5%– Longest running time 32 minutes

• Configuration 2:– 14 of 2,559 views 4 seconds, 0.5%– Longest running time 31 seconds

Page 40: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Scalability Measurements

• Controlled experiment at MIT and UCSD– Prototype NS, based on TCP/IP (Sasson)– Inject faults to test “slow” case

• Vary number of members, servers• Measure end-to-end latencies at member,

from join/leave/suspicionto corresponding view

• Average of 150 (50 slow) runs

Page 41: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

End-to-End Latency: Scalable!• Member

scalability: 4 servers (constant)

Number of Members

200

300

400

20 120

Milliseconds

sync

out-of-sync

70

200

300

400

80 180 280Number of Members

Milliseconds

sync

out-of-sync• Server and member scalability: 4-14 servers

Page 42: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Conclusion: Moshe Features• Avoiding obsolete views • A single round

– 98% of the time in one configuration– 99.8% of the time in another

• Using a notification service for WANs– Good abstraction– Flexibility to configure multiple ways– Future work: configure more ways

• Scalable “divide and conquer” architecture

Page 43: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Retrospective: Role of Theory

• Specification– Possible to implement– Useful for applications (composable)

• Specification can be met in one round “typically” (unlike Consensus)

• Correctness proof exposes subtleties– Need to avoid live-lock – Two types of detection mechanisms needed

Page 44: Scalable Group Communication for the Internet Idit Keidar MIT Lab for Computer Science Theory of Distributed Systems Group The main part of this talk is.

Future Work: The QoS Challenge

• Some distributed applications require QoS– Guaranteed available bandwidth– Bounded delay, bounded jitter

• Membership algorithm terminates in one round under certain circumstances– Can we leverage on that to guarantee QoS

under certain assumptions?

• Can other primitives guarantee QoS?