Cs 704 d dce ipc-msgpassing

Slide 1

CS 704DAdvanced Operating System(Distributed Computing)Debasis DasDistributed Computing Environment(DCE)MIT CS704D Advanced OS Class of 20112Distributed Computing Environment(DCE)Defined by Open Software Foundation (OSF)Set of services & tools that run on existing systemsProvide s the base for distributed applicationsOS on which the DCVE can be ported easily includeOSF/1, AIXDomain OS, ULTRIXHP-UX, SINIXSun OS, UNIX system V, VMS, Windows, OS/2\Network s on which it worksTCP/IP, X.25 etc

11/21/2009IT703D: Distributed Computing Debasis Das11/21/2009IT703D: Distributed Computing Debasis Das

Operating systems & Networking

DCE software

DCE ApplicationsDCE StructureDCE Creation ProcessBased on academic work already doneAcademic institutions were asked to contribute working code; almost entirely in COSF checked and integrated the code as a single packageFirst release 1992

11/21/2009IT703D: Distributed Computing Debasis DasDCS Components 1Threads packageProgramming model for concurrent applicationsCreate & controls threads within an applicationSynchronization of global resource access.Remote Procedure call (RPC) facilityClient-server communicationsNetwork and protocol independentAutomatic data conversionsDistributed Time Service (DTS)Clock synchronizing mechanisms, NIST& Other sources11/21/2009IT703D: Distributed Computing Debasis DasDCE Components 2Name ServicesCell directory services (CDS)Global Directory Services (GDS)Global directory Agent (GDA)Servers, files, devices can be uniquely named and found without being location dependentDistributed File Systems (DFS)System-wide file systemLocation transparency, high performance and high availabilityFile services to clients of other file servicesSecurity ServiceAuthentication & authorization to protect against illegitimate access11/21/2009IT703D: Distributed Computing Debasis Das

11/21/2009IT703D: Distributed Computing Debasis DasComponentComponents using the serviceThreadsNoneRPCThreads, name, securityDTSThreads, RPC, name, securityNameThreads, RPC, DTS, securityDFSThreads, RPC, DTS, name, securitySecurityThreats, RPC, DTS, nameDCE Components InterdependenceDCE CellsA DCE is highly scalable, thousands of computers, millions of usersConcept cells helps make it manageable and scalableA cell is A group of users, machines and other resourcesCommon purpose, share common DCE servicesMin cell configuration requiresA cell directory service, a security server, a distributed time server and 1 or more client machinesEach client must have client processes for security service, cell directory service, distributed time service, RPC facility and threads facility Will have a distributed client process if a distributed file server exists within cell11/21/2009IT703D: Distributed Computing Debasis DasDeciding Cell BoundariesPurposeAdministrationSecurityOverhead11/21/2009IT703D: Distributed Computing Debasis DasPurposeUsers working on same goal on same cell; high local interaction, low interaction with distant cellsProduct oriented or function orientedProduct oriented: one cell for one productFunction oriented: one function for all products in a cell. Design dept for a product company for example

11/21/2009IT703D: Distributed Computing Debasis DasAdministrativeManaging user accounts, access rights, resources usageAdministrator needs to know resources in the cell, usersCell may be decided by the administrative reach manageable by an administrator11/21/2009IT703D: Distributed Computing Debasis DasSecurityUsers trusting each other more should be in the same cellCell boundaries can act as a firewallRequire additional authentication/authorization for distant cell resources11/21/2009IT703D: Distributed Computing Debasis DasOverheardName resolution and user authentication is an overhead alwaysThese overheads cause time and performance penaltiesSo cell boundary should encompass most of the resources used by the particular group of users11/21/2009IT703D: Distributed Computing Debasis Das11/21/2009IT703D: Distributed Computing Debasis DasMessage PassingInter Process CommunicationOriginal sharing or shared data approachCopy sharing or message passing approach


SharedMemoryarea

P1

P2

P2

P1

Shared DataMessage PassingNo shared memory Message passing favoredSimple primitivesSendReceive11/21/2009IT703D: Distributed Computing Debasis DasIPC in Distributed EnvironmentDesirable FeaturesSimplicityUniform SemanticsEfficiencyReliabilityCorrectnessFlexibilitySecurityPortability11/21/2009IT703D: Distributed Computing Debasis DasSimplicitySimple & easy to useStraightforward to construct new applications communicating via primitivesSend messages to distributed parts of an application without worrying about the network & other complexitiesClean & simple semantics11/21/2009IT703D: Distributed Computing Debasis DasUniform SemanticsMessage passing requiredLocal communicationRemote communicationVery important that these modes be exactly similar or close11/21/2009IT703D: Distributed Computing Debasis DasEfficiencyMessage passing efficiency optimization viaAvoid the cost of establishing and the closing connections for every messageMinimizing the cost of maintaining connectionsPiggybacking of acknowledgement for previous message through the next message from the receiver11/21/2009IT703D: Distributed Computing Debasis DasReliabilityNode crashes & link failures can happenMust be able to guarantee delivery, guard against lost packets through ack and retransmission (lost packets)Duplicate packets handling, via sequence numbers etc.11/21/2009IT703D: Distributed Computing Debasis DasCorrectnessScenarioSender to multiple receiversOne receivers ability to receive from multiple sendersProperties requiredAtomicity: delivered to all of the receivers or noneOrdered delivery: delivered in proper orderSurvivability: delivered despite partial failures of processes, machines and links

11/21/2009IT703D: Distributed Computing Debasis DasFlexibilityIPC protocol primitives should be such thatTypes & levels of correctness required can be flexibleFlexibility to permit any kind of control flow; including synchronous and asynchronous send/receive primitives11/21/2009IT703D: Distributed Computing Debasis DasSecurityAuthentication of sender at receiverAuthentication of receiver by senderEncryption of message before transmission over the network11/21/2009IT703D: Distributed Computing Debasis DasPortabilityThe message passing system itself be portable. Should be easy to create the system on a different machineThe application that uses the IPC primitives should be portable. IPC will need to take care of differences due to heterogeneity of systems. The higher level message passing primitives need to hide the differences.

11/21/2009IT703D: Distributed Computing Debasis DasIssues in IPC by Message PassingMessageAddressSending and receiving process addressesSequence numberMessage id, useful to detect lost packets or duplicatesStructural informationType & length of message (internal or external)11/21/2009IT703D: Distributed Computing Debasis Das11/21/2009IT703D: Distributed Computing Debasis Das

Type

No. ofBytes/elements

Structural Information

Actual data/Pointer to data

Addresses

SequenceNumber/Messageid

ReceiveProcessaddr

SendProcessaddrTypical Message StructureDesign Issues in IPC ProtocolsWhos the sender?Whos the receiver?One or more receivers?Is the message guaranteed to have been accepted by the receiver?Should the send wait for a reply?Action required in case of a catastrophic failure of node/linkWhat action if receiver not ready? Buffer? What if buffer is full?What if more than one message is outstanding? Can receiver accept messages out of sequence?11/21/2009IT703D: Distributed Computing Debasis DasSynchronizationSemantics for synchronizationBlockingProcess gets blocked/suspended after executing send or receive primitives Non blockingProcess can continue even after executing a send/receive primitiveQuestion How does the blocked receiver know when a message has arrived?PollingInterrupts

11/21/2009IT703D: Distributed Computing Debasis DasConditional Receive PrimitiveA variation in non-blocking receive primitive

Return the message if available or Indicate that there are no messages11/21/2009IT703D: Distributed Computing Debasis Das11/21/2009IT703D: Distributed Computing Debasis Das

message

Reply

Sending processBlocked here

Unblocked here

Receive process

Synchronous Communication with Blocking Send/Receive PrimitivesSynchronous CommunicationsAdvantagesSimple to implementAssured reliabilityNo backward error recovery is necessary if a message is lost or undelivered11/21/2009IT703D: Distributed Computing Debasis DasBlocked Forever!Blocking send can get blocked if reply not receivedBlocking receive can get blocked if the message is never received11/21/2009IT703D: Distributed Computing Debasis DasBufferingMessage buffering strongly linked to synchronization strategySynchronous communication does not need a bufferAsynchronous communication would need unbounded set of buffersSomething in betweenSingle message bufferMultiple messages or bounded buffer


n

3

2

1

Single bufferNull Buffer

Multiple bufferSending processesReceiving processesBuffering MechanismsNull Buffer StrategyAlternative 1:Send message is delayed until receive is ready. Block send process, back up, wait for receiver ack then wake up send process which starts at the send message.Alternative 2: send message, start a time out. Send again on time out. Retry for n number of times.

Single copy from send process address space to receive process address space.11/21/2009IT703D: Distributed Computing Debasis DasSingle Message BufferUse a single buffer at receive nodeAvoids delays in a networked/distributed systemIn synchronous mode at most one message can be outstandingEither in kernel or application address space11/21/2009IT703D: Distributed Computing Debasis DasUnbounded Capacity BufferApplicable for the general asynchronous communication situationUnbounded buffers required to ensure all the messages sent are received, even if several may be outstanding11/21/2009IT703D: Distributed Computing Debasis DasFinite bound/Multi message BufferBuffer overflow can happenStrategy to manageUnsuccessful communicationReturn an error message when buffer is fullMakes message passing unreliableFlow controlled communicationSender is blocked until some messages are accepted by the receiverTrue asynchronous nature is lostCan cause unpredictable deadlocks11/21/2009IT703D: Distributed Computing Debasis DasBuffer Capacity inMulti-message BuffersDeciding buffer capacity needs careful planningCreate-buffer like system calls can be usedCreating buffers in kernel space may limit the buffersCreating buffers in the application space could be more flexible. Memory allocation process will allocate memory

Better concurrency & flexibility but difficult to design11/21/2009IT703D: Distributed Computing Debasis DasMulti Datagram MessagesEvery network has a max size of data packet in terms of maximum transfer unit(MTU)Any message larger than a MTU will have to multiple pieces or multiple datagramTheres an additional process of assembling the messageMany a time ca introduce a reliability issue complicating communication11/21/2009IT703D: Distributed Computing Debasis DasEncoding/Decoding Message DataProblems with semantics of DataAbsolute pointers lose meaning. For example a tree object with all the node values and information about structure is required. The sender and receiver may be different architectures. Receiver needs to reconstructProgram objects have different sizes. Receiver needs to know the object type

So you need encoding at the sender and decoding at the receiver.11/21/2009IT703D: Distributed Computing Debasis DasTypes of EncodingTagged representationA tag along with the data item, describes the data itemExample: ASN.1 CCITT standard, Mach distributed OSUntagged representationReceiver knows the structure so that it can reconstruct the dataExample: SUN XDR eXternal Data Representation11/21/2009IT703D: Distributed Computing Debasis DasProcess AddressingExplicit addressProcess id used as a parameter in the communication primitiveSend(process_id, message)Receive(process_id, messageImplicit addressSend_any(service_id, message) functional addressingReceive_any(process_id, message)11/21/2009IT703D: Distributed Computing Debasis DasProcess Addressing Unix BSDMachine_id@local_id32 bit IP address as machine_id, 16 bit local process idNo global co-ordination requiredProcess cannot be transferred for load balancing or any other reason11/21/2009IT703D: Distributed Computing Debasis DasProcess Addressingwith MigrationMachine_id, local_id and a machine_idThe third parameter is the latest known location of the processA link that has original machine id and the new machine id is left on the machine.On the new machine a mapping table points to the new local_id against the old id.11/21/2009IT703D: Distributed Computing Debasis DasDisadvantagesAddressing with MigrationThe overhead of locating a process increases as the process migrates several times.The chain could be broken and process not locatable, if an intermediate node is down.Location transparency fails

Problem of poor reliability, scalability11/21/2009IT703D: Distributed Computing Debasis DasProcess addressingAchieving TransparencyMaintain a central allocator, a simple counter. It allocates current count as the process id and increments the counterUse name serversHas table with high level id and corresponding low level idHigh level id is some ASCII stringLow level id is machine_id@local_id format

Problems of poor reliability and scalability11/21/2009IT703D: Distributed Computing Debasis DasFailure HandlingDistributed systems over potential for parallel operationBut, it is prone to partial failuresNode crashesCommunication link failures

Leads to following problems11/21/2009IT703D: Distributed Computing Debasis DasProblems Caused By Communication FailuresLoss of request messageLink failure, receiver node downLoss of response messageLink failure, sender node downUnsuccessful execution of the requestReceive node crashed during processing


RequestAckReplyAckClientServer4 Message IPC Protocol11/21/2009IT703D: Distributed Computing Debasis Das

RequestReplyAckClient Server3 Message IPC Protocol11/21/2009IT703D: Distributed Computing Debasis Das

RequestReplyClientServer2 Message IPC ProtocolIndempotency&Handling Duplicate RequestsIndempotency means repeatabilityNon-indempotent routines can cause problemsAvoid operation of non-indempotent routine if the request is a duplicateNumber the requests, cache the repliesWhen request received, check if a reply already existsSend the earlier reply again if it exists11/21/2009IT703D: Distributed Computing Debasis DasLost & Out of Sequence Packets in Multi- Datagram MessagesHeader of each packet contains 2 extra fieldsOne is for size of full messageA bitmap indicating position of the packet in the sequenceReceiver sends a bitmap indicating packets not received after timeoutSender sends the missing packets in selective repeatWhen reception completed, receiver sends an ack for the message

11/21/2009IT703D: Distributed Computing Debasis DasGroup CommunicationOne to many (single sender, multiple receivers)Many to one (multiple senders, one receiver)Many to many (multiple senders, multiple receivers)11/21/2009IT703D: Distributed Computing Debasis DasCommunication IssuesOne to ManyAlso known as multicast communicationBroadcast- multicast to ALL receiversIssuesGroup Management, and addressingMessage delivery to receiver processesBuffered and un-buffered multicastSend to all and bulletin board semanticsFlexible reliability in multicast communicationAtomic MulticastGroup Communication primitives11/21/2009IT703D: Distributed Computing Debasis DasGroup ManagementGroupsOpen and closed groupsFlexibility required to accommodate bothGroup creation/deletion, membership of groupsGroup servers can manage thatPoor reliability & scalabilityReplication is possible but consistency maintenance is an issue/overhead11/21/2009IT703D: Distributed Computing Debasis DasGroup Addressing2 level namingTop level is a ASCIILower level quite often underlying hardware dependentOn networks with multicast addresses, use the sameOr use broadcast addressIf machines are not on same LAN, one to one messages will have to be sent. Lower level address then is a list of machines running processes in the group11/21/2009IT703D: Distributed Computing Debasis DasMessage Delivery to Receiver ProcessesApplication send a message to higher level idSender kernel contacts group server that has lower level id and the process ids of members of the groupList of process ids inserted into messageIf the low level id is a multicast/broadcast address, one message goes to the address.However if it is a list of id, the message is sent to each machineIf in a broadcast group, a machine may find message irrelevant, discards itSender unaware of group size11/21/2009IT703D: Distributed Computing Debasis DasBuffered and Un-buffered MulticastMulticasting is asynchronousIt is unrealistic to expect the sender to be blocked until all the group members have received the messageSending process is unaware of how many members are thereUn-bufferedReceivers not ready, lose the messageBufferedMessage is buffered, eventually everyone will receive the message11/21/2009IT703D: Distributed Computing Debasis DasSend to all and Bulletin Board SemanticsSend to all: message to all, buffered until read by ll the members of multicast groupBulletin board: send message to a channel. Receivers copy off the message and act on itAvoids problems as followsReceiver would read and act only when it is in an appropriate stateSender will receive responses only when it is in right state.Message may be withdrawn when service is no longer required

11/21/2009IT703D: Distributed Computing Debasis DasFlexible Reliability in MulticastThe reliability of response required depends on the application and can be of one of the following classes0-reliable: No response required, e.g. multicast of time info1-reliable: exactly one response is required, e.g. a bid situationM-out-of-n reliability: m that is greater than 1and less than n response required, e.g. majority voting for consistency checkAll-reliable: all responses required, e.g. files copied to file servers11/21/2009IT703D: Distributed Computing Debasis DasAtomic MulticastMulticast is completed only when ALL receivers received the messageUsual time out repeats until all have been acknowledgedFails if either the sender or one or more receivers fail during the processOne way around (Tannenbaum) is for every receiver to do an atomic multicast of the message receivedEventually all will receive message. Use in rare cases

11/21/2009IT703D: Distributed Computing Debasis DasGroup Communication PrimitivesOne to one & one to many communication could send the same send primitive. As one message is sent out in either case.However, to find the lower level address there could be confusion. What should be contacted name server or a group server?Separate primitives like send and send_group could allow send_group to have parameters such as level of reliability and atomicity11/21/2009IT703D: Distributed Computing Debasis DasCommunication IssuesMany to oneSelective or a non-selective receiverNon-determinism is an issueNot known which of the sender may have the right information for the receiver firstSometimes it is useful to have a group of senders to be controlled dynamicallyFor example allowing producers to write only when buffer is not fullMore of a programming issue rather than a OS issue

11/21/2009IT703D: Distributed Computing Debasis DasCommunication IssuesMany to ManyOne to one and one to many are special cases of the general many to many communication. So those constraints applyBesides ordered delivery of messages is important in many to many caseMessage sequencing is required. One to many case, this is trivial.In many to one the receiver has to order the messages receivedIn many to many message can arrive at different times at different nodes, due to different delays seen over LAN/WAN11/21/2009IT703D: Distributed Computing Debasis DasOrdered Message Delivery SemanticsAbsolute OrderingConsistent OrderingCausal Ordering11/21/2009IT703D: Distributed Computing Debasis DasOrdered Delivery Absolute OrderingUse a global time stamp as the sequence numberKeep the clocks at every node synchronizedUse a time window to deliver messages from queueA message with lower time stamp may arrive laterThe time window can take care of that11/21/2009IT703D: Distributed Computing Debasis DasOrdered Delivery-Consistent Ordering-1Same order of arrival at all receivers is important rather than the order the messages were sentOne method of many to many is to break up the scheme in may to one and one to many parts.Senders send messages to one receiver that sequences the messages(called sequencer)The sequencer multicasts to all the other designated receiversEach receiver accepts the messages in a buffer. Delivers only if there are no gaps in the sequence, else waits for the other messages to arrive11/21/2009IT703D: Distributed Computing Debasis DasOrdered Delivery-Consistent Ordering-2Poor reliability, liable to single point of failureABCAST protocol is better, it is a distributed algorithm and hence not subject to a single point failureSender assigns a sequence number greater than any assigned earlier, a simple counter will do. Send to all members of the group.All the members return a proposed sequence number based on max(Fmax, Pmax)+1+i/N, i is the member no., N is the total no. of members, Fmax is the largest sequence no. proposed by the group, Pmax is the max no. proposed by the memberSender choose the largest proposed no. and lets everyone know as the committ no.All members use this final committed no as the sequence no. Delivery to applications use this no.11/21/2009IT703D: Distributed Computing Debasis DasOrdered Delivery-Causal OrderingStrict consistent ordering may not be required always. A weaker causal ordering is enough many a times.If there is an happened-before relationship the messages must be delivered in order, otherwise order need not be maintained11/21/2009IT703D: Distributed Computing Debasis DasOrdered Delivery-Causal OrderingOne method is CBASTA vector of n is maintained, value at each position is the sequence no. of last message in sequence received from that nodeWhen sending message, increment own element in vector and send the vector along with messageMessage buffered in runtime system and tests doneDelivered to receive process if it passes both testsS[i]=R[i]+1, S[j] is less or eq to R[j] all j not eq is


3

2

5

1

3

2

5

1

2

2

5

1

3

2

4

1Fails first test A[1]=C[1] + 1

Fails second test at A[3]

Cs 704 d dce ipc-msgpassing

Documents

Transcript of Cs 704 d dce ipc-msgpassing