CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

46
CS-556: Distributed Systems Manolis Marazakis [email protected] Inter-process Communication (III)

description

Fall Semester 2005CS-556: Distributed Systems Berkeley Sockets (II) Connection-oriented communication pattern using sockets.

Transcript of CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Page 1: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

CS-556: Distributed Systems

Manolis [email protected]

Inter-process Communication (III)

Page 2: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Berkeley Sockets (I)

Socket primitives for TCP/IP.

Primitive Meaning

Socket Create a new communication endpointBind Attach a local address to a socket

Listen Announce willingness to accept connections

Accept Block caller until a connection request arrives

Connect Actively attempt to establish a connectionSend Send some data over the connectionReceive Receive some data over the connectionClose Release the connection

Page 3: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Berkeley Sockets (II)

Connection-oriented communication pattern using sockets.

Page 4: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Connected vs Connectionless (I)

IP best-effort, unreliable, connectionless Remembers nothing about a packet after it

has sent it Checksum computed on header onlyNo assumptions about the underlying physical medium Serial link, Ethernet, Token ring, X.25, ATM,

wireless CDPD, …UDP: (optional) checksum notion of port

Page 5: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Connected vs Connectionless (II)

TCP reliable connection-oriented service Segments are sent in IP datagrams Checksum of data in each segment Sequence # of the 1st byte in the segment Acknowledge-and-retransmit mechanism

Each side maintains a receive window Range of sequence # that this side is prepared to receive

Any arriving data with sequence # outsiode the receive window is discarded

Queuing of data arriving out-of-order Window slides to the right, if the next expected sequence

# has arrived … and an ACK is sent back with the sequence # expected

next Send window:

Bytes sent but not yet acknowledged RTO timer (retransnmission timeout) Timeout does not always mean that the data was lost !!

Bytes that can be sent but have not yet been sent

Page 6: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

UDP Failure ModelOmission failures

timeouts duplicate messages lost messages Need to maintain history

Last reply sent to each client provided that a client can make only one request at a time

interprets each request as the ACK for the previous reply periodic ‘’purge’’ of history

No ACK for the last response received before client terminates

Fixed max. buffer size (8 KB) No message order guaranteeProcess crash failures

Page 7: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

TCP Failure ModelReliable message delivery

checksums, sequence numbers, timeouts no need for applications to deal with

retransmissions duplicates reordering

no need for historiesFlow control mechanism

large transfers without overwhelming the receiver… BUT not reliable sessions:

Connections may be severed or severely congested Processes cannot distinguish network from process failure Processes cannot tell if their recent messages were

received

Page 8: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

TCP is a stream protocolNo inherent notion of “message boundary” The amount of data in a packet is not directly

related to the amount of data delivered to TCP in the send() call

No reliable for the receiver to determine how the data was packetized Several packets may have arrived between recv()

calls The amount of data returned in any given read()

is unpredictable Fixed-length messages Variable-length messages

End-of-record marker Fixed-length header (including record length) + variable data

Page 9: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

TCP Failure Modes (I)“TCP guarantees delivery of the data it sends” True or False ?

Guarantee to whom ?

False … How can we handle outages & crashes ?

TCP

NIC

IPNICIP

NICIP

TCP

NIC

IP

Application (A) Application (B)User-space

kernel-space

Page 10: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

TCP Failure Modes (II)IP is a best-effort, unreliable protocol … so the TCP layer is the first place in the data path

where it makes senses to even talk about guarantees The sender’s TCP layer can make no guarantee about segments that arrive at the receiver’s TCP layer An arriving segment may be corrupted, or it may

contain duplicate data, or it may be out of order …The receiver’s TCP layer guarantees to the sender’s TCP layer that any segment that it ACKs & all data that came before it have been correctly received This does not mean that the data has been delivered

to the application … ot that it will ever be delivered !! For example, the receiving host may crash after the ACK but

before delivery …

Page 11: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

TCP Failure Modes (III)It also makes sense to talk about guarantees at application B (receiver) There can be no guarantee that all data sent by

application A will arrive However, all data that does arrive will be in order

and uncorrupted

Avoid the attitude that “TCP will take care of everything”TCP is an end-to-end protocol, providing a reliable transport mechanism between peers …

The “peers” are the TCP layers of the sender & the receiver !!

Page 12: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

TCP Failure Modes (IV)Explicit acknowledgements

What does the client do if the server does not ACK receipt ?? It may not be safe to simply resend a request …

Network outage Peer crashes Peer’s host crash

When a problem occurs at an endpoint, there is generallyno alternative path The problem persists until it is repaired

An intermediate router may send the originator an ICMP message indicatingthat the destination network or the host is unreachableOR: The sender eventually times-out & resends the segments not ACKed. This continues until the sender gives up & drops the connection (~9 minutes).Pending read ETIMEDOUTOtherwise, the next write fails SIGPIPE or EPIPE

Page 13: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

TCP Failure Modes (V)

Peer crash: Indistinguishable from the case of the peer calling

close() and then exit() The peer’s TCP layer issues a FIN segment

This does not necessarily imply that the peer has no more data to send, or even that it is not willing to receive more data …

Reception of the FIN may come at different execution states of the application If client is blocked, TCP has no way of notifying it

The next transmission generates a RST segment ECONNRESET If the RST is ignored & more data is transmitted SIGIPE

This may occur if the client performs >=2 consecutive write() calls without an intervening read() Notification takes place only after the 2nd write()

If client has a pending read(), it gets an immediate error indication (eg: read() returns EOF)

Page 14: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

TCP Failure Modes (VI)Peer’s host crash: The peer’s TCP cannot issue the FIN segment Until recovery, this case cannot be distinguished

from a network outage The peer’s TCP no longer responds, but the sender keeps

retransmitting … Until either the host recovers, or the sender gives up

the connection ETIMEDOUT If the host reboots before the sender gives up, a

retransmitted segment may arrive at the TCP layer … without it having knowledge of the connection RST If sender has a read() pending ECONNRESET Else, the next write() results in a SIGPIPE signal

Page 15: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Behavior of PeersChecking for client termination Heartbeats, timeouts for read operations,

SO_KEEPALIVE option, …Checking for valid input Buffer overflow errors

Page 16: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

We rely on DNS …

Page 17: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

The Message-Passing Interface

Some of the most intuitive primitives of MPI.

Primitive Meaning

MPI_bsend Append outgoing message to a local send buffer

MPI_send Send a message and wait until copied to local or remote buffer

MPI_ssend Send a message and wait until receipt startsMPI_sendrecv Send a message and wait for replyMPI_isend Pass reference to outgoing message, and continue

MPI_issend Pass reference to outgoing message, and wait until receipt starts

MPI_recv Receive a message; block if there are noneMPI_irecv Check if there is an incoming message, but do not block

Page 18: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Group CommunicationMulticasting: 1-to-many comm. pattern

Applications: replicated services (better fault tolerance) discovery of services replicated data (better performance) propagation of event notifications

Failure model: depends on implementation:

IP multicast (UDP datagrams): omission failures class-D Inet addresses: “1110” bit prefix TTL

reliable multicast ordered multicast

FIFO Causal Total

Page 19: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Conventional Procedure Call

a) Parameter passing in a local procedure call: the stack before the call to read

b) The stack while the called procedure is active

Page 20: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Software layersApplications and Services

RPC and RMI

request-reply protocolmarshalling and external data representation

UDP and TCP

mid

dlew

are

RPC is more than a (transport) protocol: a structuring mechanism for distributed systems

Page 21: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Steps of a Remote Procedure Call1. Client procedure calls client stub in normal way2. Client stub builds message, calls local OS3. Client's OS sends message to remote OS4. Remote OS gives message to server stub5. Server stub unpacks parameters, calls server6. Server does work, returns result to the stub7. Server stub packs it in message, calls local OS8. Server's OS sends message to client's OS9. Client's OS gives message to client stub10.Stub unpacks result, returns to client

Page 22: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Client and Server StubsPrinciple of RPC between a client & server program.

Page 23: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Example (Sun RPC - ONC)long square(long) example client ren.eecis.udel.edu 11 result: 121

Need RPC specification file (square.x): defines procedure name, arguments & resultsRun rpcgen square.x: generates square.h, square_clnt.c, square_xdr.c, square_svc.csquare_clnt.c & square_svc.c: Stub routines for client & serversquare_xdr.c: XDR (External Data Representation) code - takes care of data type conversions

Page 24: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

RPC Specification File (square.x)

struct square_in {long arg1;

};

struct square_out {long res1;

};

program SQUARE_PROG { version SQUARE_VERS { square_out SQUAREPROC(square_in) = 1; // procedure # } = 1; // version #} = 0x321230000; // program #

IDL – Interface Definition Language

Page 25: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Parameter Specification & Stub Generation

procedure Corresponding message

Page 26: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Writing a Client & a Server

The steps in writing a client & a server in DCE RPC.

Page 27: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Binding (SUN RPC)Port Mapper (rpcbind) listens at UDP port 111Server registers program ID & version

rpcinfo -p -> display all registered RPC serversWhen client issues clnt_create, the port mapper is contacted:

program-to-port number mapping arguments: (program ID, version, protocol) response: server’s port number

Page 28: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Binding (DCE)

Page 29: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Passing Value Parameters (I)

Page 30: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Passing Value Parameters (II)

a. Original message on Pentium (little-endian)b. The message after receipt on SPARC (big-endian)c. The message after being inverted.

Page 31: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Passing Value Parameters (III)How to pass pointers ? Meaningful only within a specific address

space !Arrays (of known length) & structures: Copy/restore semantics (bet. stubs) IN/OUT/INOUT markers

Optimization: may eliminate one copy operationPointer to an arbitrary data structure ? No general solution Work-around:

Pass back the pointer to its “source”

Page 32: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

External Data Representation (I)

Data structures: “flattened” on transmission rebuilt upon receptionPrimitive data types: byte order (big-endian: MSB comes first) ASCII vs UNICODE (2 bytes per

character) marshalling/unmarshalling

to/from agreed external format

Page 33: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

External Data Representation (II)

XDR (RFC 1832), CDR (CORBA), Java: data -> byte stream object references

HTTP/MIME: data -> ASCII text

IP address port time object ID interface ID

Page 34: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

CORBA CDR example:

The flattened form represents a Person struct with value: {‘Smith’, ‘London’, 1934}

0–34–78–1112–15

16–1920-2324–27

5"Smit""h___" 6"Lond""on__"1934

index in sequence of bytes 4 bytes

notes on representation

length of string

‘Smith’

length of string‘London’

unsigned long

Page 35: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Properties of TCPConnected vs Connectionless ProtocolsTCP is a stream protocolPerformance of TCPAvoid re-inventing TCP !!TCP failure modesBehaviour of peersLAN vs WAN testingTools & Resources

Page 36: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Basic socket calls

recvsend

socket

bind localhost

sockaddr_in()

listen

accept peer

sockaddr_in()

socket

connect

recvsend

peer

sockaddr_in()

SERVER CLIENT

Page 37: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Performance of TCP (I)4.4BSD Implementation: UDP: ~800 LOC TCP: ~4,500 LOC

CPU processing: checksums, data copyingTCP ACKs: Receiver can piggyback the ACK Usually every second segment is ACKed .. May even delay ACKs (up to 0.5 sec)

Connection setup: 3 segments 1 ½ RTT: SYN, SYN+ACK, ACK

Connection tear-down: 4 segments FIN, ACK, FIN (server-to-client), ACK Except the last segment, these can be combined

with data-bearing segments

Page 38: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Performance of TCP (II)Results from a benchmark involving transmission of 5,000 data blocks UDP datagram size=TCP write size=1,440 bytes

Ethernet frame: 1,500 bytes IP header: 20 bytes, TCP header: 20 bytes TCP options: 12 bytes

Average over 50 runsClient produces data blocks, transmits them, and then exitsServer may run on localhost (127.0.0.1) Same host as the client, but given as an address Other host

Page 39: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Performance of TCP (III)Server TCP UDP

time MB/sec time MB/sec drops

Client 2.88 2.5 1.96 3.67 336

Localhost

0.95 7.53 1.97 3.64 272

Remote 7.18 1.002 5.82 1.23 440

Localhost (loop-back): MTU=16,384

Client (network I/f): MTU=1,500

Page 40: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Performance of TCP (IV)Server TCP UDP

time MB/sec time MB/sec drops

Client 1.05 1.41 1.63 0.91 212

Remote 1.55 0.965 1.91 0.78 306

Results for write size=300 bytes

Page 41: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Avoid re-inventing TCP !!Retransmissions ? RTO

Must be adjustable Exponential back-off

Flow control Sliding window

Congestion controlMatching replies to requests ? Sequence # for each requestEfficiency of the implementation ? TCP code base is highly optimized … and runs in kernel-space

Page 42: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

LAN vs WAN testingPerformance on the WAN may not be satisfactory, due to the extra latency … may have to reconsider the designIncorrect code is more likely to be triggered on the WAN … assumptions on volume/rate of

arriving data

Page 43: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

HTTP

GET //www.dcs.qmw.ac.uk/index.html HTTP/ 1.1

URL or pathnamemethod HTTP version headers message body

HTTP/1.1 200 OK resource data

HTTP version status code reason headers message body

•Resource := MIME-encoded data•Content negotiation•Authentication

Methods:•GET, HEAD, POST•PUT, DELETE, TRACE, OPTIONS

Page 44: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Tools (I)ping IP header + ICMP echo request/reply

tcpdump Network analyzer – “sniffer”

traceroute Determine the network path by forcing each intermediate

router to send an ICMP error message to the originator Send a UDP datagram with TTL=1 - so that the 1st router in the

path will discard it ! Send a 2nd UDP datagram with TTL=2 – so that the 2nd router in

the path will discard it ! … At the last hop, TTL=1 & an attempt is made to deliver the

datagram (generating the ICMP error message “port unreachable”)

Page 45: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

Tools (II)ttcp Benchmarking tool, with –many- parameters

UDP or TCP transfers, buffers, size of read/write’slsof Determine which process has a “file descriptor” open

(file or socket) lsof –i TCP:6000 lsof –i @remotehost.xdomain.net

netstat Active sockets: netstat –af inet Interfaces: netstat –i Routing table: netstat -rn Protocol statistics: netstat –sp tcp

System call tracers: strace, truss, ktrace

Page 46: CS-556: Distributed Systems Manolis Marazakis Inter-process Communication (III)

Fall Semester 2005 CS-556: Distributed Systems

ResourcesBooks: Richard Stevens:

TCP/IP illustrated series Protocols, Implementation, T/TCP/HTTP/NNTP/Domain

Sockets UNIX Network Programming series

Networking APIs: Sockets, XTI Interprocess Communication

J.C. Snader: “Effective TCP/IP Programming”RFCs: http://www.rfc-editor.org