Distributed Systems · Distributed Systems Principles and Paradigms Chapter 04 (version February...
Transcript of Distributed Systems · Distributed Systems Principles and Paradigms Chapter 04 (version February...
Distributed SystemsPrinciples and Paradigms
Chapter 04(version February 18, 2008)
Maarten van Steen
Vrije Universiteit Amsterdam, Faculty of ScienceDept. Mathematics and Computer Science
Room R4.20. Tel: (020) 598 7784E-mail:[email protected], URL: www.cs.vu.nl/∼steen/
01 Introduction02 Architectures03 Processes04 Communication05 Naming06 Synchronization07 Consistency and Replication08 Fault Tolerance09 Security10 Distributed Object-Based Systems11 Distributed File Systems12 Distributed Web-Based Systems13 Distributed Coordination-Based Systems
00 – 1 /
Layered Protocols
• Low-level layers
• Transport layer
• Application layer
• Middleware layer
04 – 1 Communication/4.1 Layered Protocols
Basic Networking Model
Physical
Data link
Network
Transport
Session
Application
Presentation
Application protocol
Presentation protocol
Session protocol
Transport protocol
Network protocol
Data link protocol
Physical protocol
Network
1
2
3
4
5
7
6
Drawbacks:
• Focus on message-passing only• Often unneeded or unwanted functionality• Question: Violates transparency?
04 – 2 Communication/4.1 Layered Protocols
Low-level layers
Physical layer: contains the specification and imple-mentation of bits, and their transmission betweensender and receiver
Data link layer: prescribes the transmission of a se-ries of bits into a frame to allow for error and flowcontrol
Network layer: describes how packets in a networkof computers are to be routed.
Observation: for many distributed systems, the lowest-level interface is that of the network layer.
04 – 3 Communication/4.1 Layered Protocols
Transport Layer
Important: The transport layer provides the actualcommunication facilities for most distributed systems.
Standard Internet protocols:
• TCP: connection-oriented, reliable, stream-orientedcommunication• UDP: unreliable (best-effort) datagram communi-
cation
Note: IP multicasting is generally considered a stan-dard available service.
04 – 4 Communication/4.1 Layered Protocols
Middleware Layer
Observation: Middleware is invented to provide com-mon services and protocols that can be used by manydifferent applications:
• A rich set of communication protocols , but whichallow different applications to communicate
• (Un)marshaling of data, necessary for integratedsystems
• Naming protocols , so that different applicationscan easily share resources
• Security protocols , to allow different applicationsto communicate in a secure way
• Scaling mechanisms , such as support for repli-cation and caching
Note: what remains are truly application-specificprotocols
Question: Such as...?04 – 5 Communication/4.1 Layered Protocols
Types of Communication (1/3)
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
Distinguish:
• Transient versus persistent communication• Asynchrounous versus synchronous commu-
nication
04 – 6 Communication/4.1 Layered Protocols
Types of Communication (2/3)
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
Transient communication: A message is discardedby a communication server as soon as it cannot bedelivered at the next server, or at the receiver.
Persistent communication: A message is stored ata communication server as long as it takes to deliverit at the receiver.
04 – 7 Communication/4.1 Layered Protocols
Types of Communication (3/3)
Client
Server
Synchronize after processing by server
Synchronize at request delivery
Synchronize at request submission
Request
Reply
Storage facility
Transmission interrupt
Time
Places for synchronization:
• At request submission• At request delivery• After request processing
04 – 8 Communication/4.1 Layered Protocols
Client/Server
Some observations: Client/Server computing is gen-erally based on a model of transient synchronouscommunication :
• Client and server have to be active at the time ofcommunication• Client issues request and blocks until it receives
reply• Server essentially waits only for incoming requests,
and subsequently processes them
Drawbacks synchronous communication:
• Client cannot do any other work while waiting forreply• Failures have to be dealt with immediately (the
client is waiting)• In many cases the model is simply not appropri-
ate (mail, news)
04 – 9 Communication/4.1 Layered Protocols
Messaging
Message-oriented middleware: Aims at high-levelpersistent asynchronous communication :
• Processes send each other messages, which arequeued• Sender need not wait for immediate reply, but can
do other things• Middleware often ensures fault tolerance
04 – 10 Communication/4.1 Layered Protocols
Remote Procedure Call (RPC)
• Basic RPC operation
• Parameter passing
• Variations
04 – 11 Communication/4.2 Remote Procedure Call
Basic RPC Operation (1/2)
Observations:
• Application developers are familiar with simple pro-cedure model• Well-engineered procedures operate in isolation
(black box)• There is no fundamental reason not to execute
procedures on separate machine
Conclusion: communication between caller & calleecan be hidden by using procedure-call mechanism.
Call local procedureand return results
Call remoteprocedure
Returnfrom call
Client
Request Reply
ServerTime
Wait for result
04 – 12 Communication/4.2 Remote Procedure Call
Basic RPC Operation (2/2)
Implementationof add
Client OS Server OS
Client machine Server machine
Client stub
Client process Server process1. Client call to
procedure
2. Stub buildsmessage
5. Stub unpacksmessage
6. Stub makeslocal call to "add"
3. Message is sentacross the network
4. Server OShands messageto server stub
Server stubk = add(i,j) k = add(i,j)
proc: "add"int: val(i)int: val(j)
proc: "add"int: val(i)int: val(j)
proc: "add"int: val(i)int: val(j)
1. Client procedure calls client stub as usual.2. Client stub builds message and calls local OS.3. Client’s OS sends message to remote OS.4. Remote OS gives message to server stub.5. Server stub unpacks parameters and calls server.6. Server does work and returns result to the stub.7. Server stub packs it in message and calls OS.8. Server’s OS sends message to client’s OS.9. Client’s OS gives message to client stub.
10. Client stub unpacks result and returns to the client.
04 – 13 Communication/4.2 Remote Procedure Call
RPC: Parameter Passing (1/2)
Parameter marshaling: There’s more than just wrap-ping parameters into a message:
• Client and server machines may have differentdata representations (think of byte ordering)• Wrapping a parameter means transforming a value
into a sequence of bytes• Client and server have to agree on the same en-
coding:– How are basic data values represented (inte-
gers, floats, characters)– How are complex data values represented (ar-
rays, unions)• Client and server need to properly interpret mes-
sages, transforming them into machine-dependentrepresentations.
04 – 14 Communication/4.2 Remote Procedure Call
RPC: Parameter Passing (2/2)
RPC parameter passing:
• RPC assumes copy in/copy out semantics:while procedure is executed, nothing can be as-sumed about parameter values (only Ada sup-ports this model).• RPC assumes all data that is to be operated on is
passed by parameters. Excludes passingreferences to (global) data .
Conclusion: full access transparency cannot be re-alized.
Observation: If we introduce a remote reference mech-anism, access transparency can be enhanced:
• Remote reference offers unified access to remotedata• Remote references can be passed as parameter
in RPCs
04 – 15 Communication/4.2 Remote Procedure Call
Asynchronous RPCs
Essence: Try to get rid of the strict request-reply be-havior, but let the client continue without waiting for ananswer from the server.
Call local procedure
Call remoteprocedure
Returnfrom call
Request Accept request
Wait for acceptance
Call local procedureand return results
Call remoteprocedure
Returnfrom call
Client Client
Request Reply
Server ServerTime Time
Wait for result
(a) (b)
Variation: deferred synchronous RPC :
Call local procedure
Call remoteprocedure
Returnfrom call
Client
RequestAcceptrequest
ServerTime
Wait foracceptance
Interrupt client
Returnresults Acknowledge
Call client withone-way RPC
04 – 16 Communication/4.2 Remote Procedure Call
RPC in Practice
Essence: Let the developer concentrate on only theclient- and server-specific code; let the RPC system(generators and libraries) do the rest.
C compiler
Uuidgen
IDL compiler
C compiler C compiler
Linker Linker
C compiler
Server stubobject file
Serverobject file
Runtimelibrary
Serverbinary
Clientbinary
Runtimelibrary
Client stubobject file
Clientobject file
Client stubClient code Header Server stub
Interfacedefinition file
Server code
#include#include
04 – 17 Communication/4.2 Remote Procedure Call
Client-to-Server Binding (DCE)
Issues: (1) Client must locate server machine, and(2) locate the server.
Example: DCE uses a separate daemon for eachserver machine.
Endpointtable
Server
DCEdaemon
Client1. Register endpoint
2. Register service3. Look up server
4. Ask for endpoint
5. Do RPC
Directoryserver
Server machineClient machine
Directory machine
04 – 18 Communication/4.2 Remote Procedure Call
Message-OrientedCommunication
• Transient Messaging
• Message-Queuing System
• Message Brokers
• Example: IBM Websphere
04 – 19 Communication/4.3 Message-Oriented Communication
Transient Messaging: Sockets
Example: Consider the Berkeley socket interface ,which has been adopted by all UNIX systems, as wellas Windows 95/NT/2000/XP:
SOCKET Create a new communication endpointBIND Attach a local address to a socketLISTEN Announce willingness to accept N con-
nectionsACCEPT Block until someone remote wants to
establish a connectionCONNECT Attempt to establish a connectionSEND Send data over a connectionRECEIVE Receive data over a connectionCLOSE Release the connection
connect
socket
socket
bind listen read
read
write
write
accept close
close
Server
Client
Synchronization point Communication
04 – 20 Communication/4.3 Message-Oriented Communication
Message-Oriented Middleware
Essence: Asynchronous persistent communicationthrough support of middleware-level queues . Queuescorrespond to buffers at communication servers.
PUT Append a message to a specified queueGET Block until the specified queue is nonempty, and
remove the first messagePOLL Check a specified queue for messages, and re-
move the first. Never blockNOTIFY Install a handler to be called when a message is
put into the specified queue
04 – 21 Communication/4.3 Message-Oriented Communication
Message Broker
Observation: Message queuing systems assume acommon messaging protocol: all applications agreeon message format (i.e., structure and data represen-tation)
Message broker: Centralized component that takescare of application heterogeneity in an MQ system:
• Transforms incoming messages to target format• Very often acts as an application gateway• May provide subject-based routing capabilities⇒ Enterprise Application Integration
Queuinglayer
Brokerprogram
Repository with conversion rules and programsSource client Destination client
OS OSOS
Message broker
Network
04 – 22 Communication/4.3 Message-Oriented Communication
IBM’s WebSphere MQ (1/3)
Basic concepts:
• Application-specific messages are put into, andremoved from queues• Queues always reside under the regime of a queue
manager• Processes can put messages only in local queues,
or through an RPC mechanism
Message transfer:
• Messages are transferred between queues• Message transfer between queues at different pro-
cesses, requires a channel• At each endpoint of channel is a message chan-
nel agent• Message channel agents are responsible for:
– Setting up channels using lower-level networkcommunication facilities (e.g., TCP/IP)
– (Un)wrapping messages from/in transport-levelpackets
– Sending/receiving packets
04 – 23 Communication/4.3 Message-Oriented Communication
IBM’s WebSphere MQ (2/3)
MCA MCAMCA MCA
MQ Interface
Stub StubServerstub
Serverstub
Send queue
Program ProgramQueuemanager
Queuemanager
Routing table
Enterprise networkRPC(synchronous)
Local network
Message passing(asynchronous)
To other remotequeue managers
Client's receivequeueSending client Receiving client
• Channels are inherently unidirectional
• MQ provides mechanisms to automatically startMCAs when messages arrive, or to have a re-ceiver set up a channel
• Any network of queue managers can be created;routes are set up manually (system administra-tion)
04 – 24 Communication/4.3 Message-Oriented Communication
IBM’s WebSphere MQ (3/3)
Routing: By using logical names , in combinationwith name resolution to local queues, it is possible toput a message in a remote queue
SQ1SQ2
SQ1
SQ1SQ1SQ2
QMBQMCQMD
SQ1
SQ1SQ1
SQ1
SQ2SQ1
SQ1
SQ1SQ1
QMA
QMA QMA
QMC
QMCQMB
QMD
QMBQMD
Routing tableRouting table
Routing table Routing table
QMB
QMC
QMA
LA1LA1
LA1
LA2LA2
LA2
QMCQMA
QMA
QMDQMD
QMC
Alias tableAlias table
Alias table
QMD
SQ1
SQ2
SQ1
Question: What’s a major problem here?
04 – 25 Communication/4.3 Message-Oriented Communication
Stream-Oriented Communication
• Support for continuous media
• Streams in distributed systems
• Stream management
04 – 26 Communication/4.4 Stream-Oriented Communication
Continuous Media
Observation: All communication facilities discussedso far are essentially based on a discrete , that istime-independent exchange of information
Continuous media: Characterized by the fact thatvalues are time dependent :
• Audio• Video• Animations• Sensor data (temperature, pressure, etc.)
Transmission modes: Different timing guarantees withrespect to data transfer:
• Asynchronous: no restrictions with respect towhen data is to be delivered• Synchronous: define a maximum end-to-end de-
lay for individual data packets• Isochronous: define a maximum and minimum
end-to-end delay (jitter is bounded)
04 – 27 Communication/4.4 Stream-Oriented Communication
Stream
Definition: A (continuous) data stream is a connection-oriented communication facility that supports isochronousdata transmission
Some common stream characteristics:
• Streams are unidirectional• There is generally a single source , and one or
more sinks• Often, either the sink and/or source is a wrapper
around hardware (e.g., camera, CD device, TVmonitor, dedicated storage)
Stream types:
• Simple : consists of a single flow of data, e.g.,audio or video• Complex : multiple data flows, e.g., stereo audio
or combination audio/video
04 – 28 Communication/4.4 Stream-Oriented Communication
Streams and QoS
Essence: Streams are all about timely delivery ofdata. How do you specify this Quality of Service(QoS)? Basics:
• The required bit rate at which data should betransported.
• The maximum delay until a session has beenset up (i.e., when an application can start sendingdata).
• The maximum end-to-end delay (i.e., how longit will take until a data unit makes it to a recipient).
• The maximum delay variance, or jitter .
• The maximum round-trip delay .
04 – 29 Communication/4.4 Stream-Oriented Communication
Enforcing QoS (1/2)
Observation : There are various network-level tools,such as differentiated services by which certain pack-ets can be prioritized.
Also : use buffers to reduce jitter:
0 5
1 2 3 4 5 6 7 8
10Time (sec)
Time in buffer
15 20
Gap in playback
Packet removed from buffer
1 2 3 4 5 6 7 8Packet arrives at buffer
1 2 3 4 5 6 7 8Packet departs source
04 – 30 Communication/4.4 Stream-Oriented Communication
Enforcing QoS (2/2)
Problem: How to reduce the effects of packet loss(when multiple samples are in a single packet)?
Solution: simply spread the samples:
1 2 3 4 5 6 7 8 9 10 11 12
1 5 9 13 2 6 10 14 3 7 11 15
13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
4 8 12 16
Lost packet
Lost packet
Gap of lost frames
Lost frames
(a)
(b)
Sent
Delivered
Sent
Delivered
04 – 31 Communication/4.4 Stream-Oriented Communication
Stream Synchronization
Problem: Given a complex stream, how do you keepthe different substreams in synch?
Example: Think of playing out two channels, that to-gether form stereo sound. Difference should be lessthan 20–30 µsec!
Network
Incoming stream
Application
Receiver's machine
Procedure that readstwo audio data units foreach video data unit
OS
Alternative: multiplex all substreams into a singlestream, and demultiplex at the receiver. Synchroniza-tion is handled at multiplexing/demultiplexing point(MPEG).
04 – 32 Communication/4.4 Stream-Oriented Communication
Multicast Communication
• Application-level multicasting
• Gossip-based data dissemination
04 – 33 Communication/4.5 Multicast Communication
Application-Level Multicasting
Essence: Organize nodes of a distributed system intoan overlay network and use that network to dissem-inate data. Example: Consider a Chord-based peer-to-peer system:
1. Initiator generates a multicast identifier mid.
2. Lookup su (mid), the node responsible for mid.
3. Request is routed to succ(mid), which will be-come the root .
4. If P wants to join, it sends a join request to theroot.
5. When request arrives at Q:
• Q has not seen a join request before⇒ it be-comes forwarder ; P becomes child of Q. Joinrequest continues to be forwarded.
• Q knows about tree⇒ P becomes child of Q.No need to forward join request anymore.
04 – 34 Communication/4.5 Multicast Communication
ALM: Some costs
A
BD
C
Ra
RbRd
Rc
Internet
RouterEnd host
Overlay network
75
1
1
1
1
50
40
• Link stress : How often does an ALM messagecross the same physical link? Example: mes-sage from A to D needs to cross 〈Ra, Rb〉 twice.
• Stretch : Ratio in delay between ALM-level pathand network-level path. Example: messages B
to C follow path of length 71 at ALM, but 47 atnetwork level⇒ stretch = 71/47.
04 – 35 Communication/4.5 Multicast Communication
Epidemic Algorithms
• General background
• Update models
• Removing objects
04 – 36 Communication/4.5 Multicast Communication
Principles
Basic idea: Assume there are no write–write con-flicts:
• Update operations are initially performed at oneor only a few replicas• A replica passes its updated state to a limited
number of neighbors• Update propagation is lazy, i.e., not immediate• Eventually, each update should reach every replica
Anti-entropy : Each replica regularly chooses anotherreplica at random, and exchanges state differences,leading to identical states at both afterwards
Gossiping : A replica which has just been updated(i.e., has been contaminated ), tells a number ofother replicas about its update (contaminating themas well).
04 – 37 Communication/4.5 Multicast Communication
Anti-Entropy
• A node P selects another node Q from the systemat random.
• Push : P only sends its updates to Q
• Pull : P only retrieves updates from Q
• Push-Pull : P and Q exchange mutual updates(after which they hold the same information).
Observation: for push-pull it takesO(log(N)) roundsto disseminate updates to all N nodes (round = whenevery node as taken the initiative to start an exchange).
04 – 38 Communication/4.5 Multicast Communication
Gossiping
Basic model: A server S having an update to re-port, contacts other servers. If a server is contactedto which the update has already propagated, S stopscontacting other servers with probability 1/k.
If s is the fraction of ignorant servers (i.e., which areunaware of the update), it can be shown that withmany servers
s = e−(k+1)(1−s)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
-15.0
-12.5
-10.0
-7.5
-5.0
-2.5
k
ln(s)
Observation: If we really have to ensure that all serversare eventually updated, gossiping alone is not enough
04 – 39 Communication/4.5 Multicast Communication
Deleting Values
Fundamental problem: We cannot remove an oldvalue from a server and expect the removal to prop-agate. Instead, mere removal will be undone in duetime using epidemic algorithms
Solution: Removal has to be registered as a specialupdate by inserting a death certificate
Next problem: When to remove a death certificate (itis not allowed to stay for ever):
• Run a global algorithm to detect whether the re-moval is known everywhere, and then collect thedeath certificates (looks like garbage collection)• Assume death certificates propagate in finite time,
and associate a maximum lifetime for a certificate(can be done at risk of not reaching all servers)
Note: it is necessary that a removal actually reachesall servers.
Question: What’s the scalability problem here?
04 – 40 Communication/4.5 Multicast Communication
Example Applications
Data dissemination : Perhaps the most important one.Note that there are many variants of dissemina-tion.
Aggregation : Let every node i maintain a variablexi. When two nodes gossip, they each reset theirvariable to
xi, xj← (xi + xj)/2
Result: in the end each node will have computedthe average x̄ = ∑i xi/N.
Question: What happens if initially xi = 1 andxj = 0, j 6= i?
04 – 41 Communication/4.5 Multicast Communication