Virtual Net w ork T - EECS at UC Berkeley · Virtual Net w ork T ransp ort Proto cols for Myrinet...

13

Transcript of Virtual Net w ork T - EECS at UC Berkeley · Virtual Net w ork T ransp ort Proto cols for Myrinet...

Virtual Network Transport Protocols for Myrinet

Brent N. Chun, Alan M. Mainwaring, and David E. Culler

University of California at Berkeley

Computer Science Division

fbnc,alanm,[email protected]

AbstractThis paper describes a protocol for a general-purpose

cluster communication system that supports multipro-

gramming with virtual networks, direct and protected

network access, reliable message delivery using mes-

sage timeouts and retransmissions, a powerful return-

to-send error model for applications, and automatic

network mapping. The protocols use simple, low-cost

mechanisms that exploit properties of our intercon-

nect without limiting exibility, usability or robust-

ness. We have implemented the protocols in an active

message communication system that runs a network

of 100+ Sun UltraSPARC workstations interconnected

with 40 Myrinet switches. A progression of microbench-

marks demonstrate good performance { 42 microsecond

round-trip times and 31 MB/s node to node bandwidth

{ as well as scalability under heavy load and graceful

performance degradation in the presence of high con-

tention.

1 Introduction

With microsecond switch latencies, gigabytes persecond of scalable bandwidth, and low trans-mission error rates, cluster interconnection net-

This research is supported in part by ARPA grant F30602-

95-C-0014, generous contributions from Sun Microsystems, Inc.,

the California State Micro Program, Professor David E. Culler's

NSF Presidential Faculty Fellowship CCR-9253705, NSF Infras-

tructure Grant CDA-8722788, a NSF Graduate Research Fel-

lowship, and a National Seminconductor Corportation Graduate

Research Fellowship.

works such as Myrinet [BCF+95] can provide sub-stantially more performance than conventional lo-cal area networks. These properties stand inmarked contrast to the network environments forwhich traditional network and internetwork pro-tocols were designed. By exploiting these fea-tures, previous e�orts in fast communication sys-tems produced a number of portable communica-tion interfaces and implementations. For exam-ple, Generic Active Messages (GAM) [CLM+95],Illinois Fast Messages (FM) [PKC97, PLC95], theReal World Computing Partnerships's PM [THI96],and BIP [PT97] provide fast communication lay-ers. By constraining and specializing communi-cation layers for an environment, for example byonly supporting single-program multiple-data par-allel programs or by assuming a perfect, reliablenetwork, these systems achieved high-performance,oftentimes on par with massively parallel proces-sors.

Bringing this body of work into the main-stream requires more general-purpose and more ro-bust communication protocols than those used todate. The communication interfaces should sup-port client-server, parallel and distributed applica-tions in a multi-threaded and multi-programmedenvironment. Implementations should use pro-cess scheduling as an optimization technique ratherthan as a requirement for correctness. In a time-shared system, implementations should provideprotection and the direct application access to net-work resources that is critical for high-performance.Finally, the protocols that enable these systemsshould provide reliable message delivery, automat-ically handle non-catastrophic network errors, andsupport automatic network management tasks suchas topology acquisition and route distribution.

Section 2 presents a core set of requirements forour cluster protocol and states our speci�c assump-tions. Section 3 presents an overview of our system

1

architecture and brie y describes the four layers ofour communication system. Then in section 4, weexamine the issues and design decisions for our pro-tocols, realized in our system in network interfacecard (NIC) �rmware. Section 5 analyses perfor-mance results for several challenging microbench-marks. We �nish with related work and conclu-sions.

2 Requirements

Our cluster protocol must support multiprogram-ming, direct access to the network for all applica-tions, protection from errant programs in the sys-tem, reliable message delivery with respect to bu�eroverruns as well as dropped or corrupted packets,and mechanisms for automatically discovering thenetwork's topology and distributing valid routes.Multiprogramming is essential for clusters to be-come more than personal supercomputers. Thecommunication system must provide protection be-tween applications and isolate their respective traf-�c. Performance requires direct network accessand bypassing the operating system for all commoncase operations. The system should be resilient totransient network errors and faults { programmersought not be bothered with transient problems thatretransmission or other mechanisms can solve {but catastrophic problems require handling at thehigher layers. Finally, the system should supportautomatic network management, including the pe-riodic discovery of the network's topology and dis-tribution of mutually deadlock-free routes betweenall pairs of functioning network interfaces.

Our protocol architecture makes a number of as-sumptions about the interconnect and the system.First, it assumes that the interconnect has net-work latencies on the order of a microsecond, linkbandwidth of a gigabit or more and are relativelyerror-free. Second, the interconnect and host inter-faces are homogeneous, and the problem of interestis communication within a single cluster network,not a cluster internet. System homogeneity elim-inates a number of issues such as the handling ofdi�erent network maximum transmission units andpacket formats, probing for network operating pa-rameters (e.g., as by TCP slow-start), and guaran-tees that the network fabric and the protocols usedbetween its network interfaces are identical. Thisdoes not preclude use of heterogeneous hosts at theendpoints, such as hosts with di�erent endianness.Lastly, the maximum number of nodes attached tothe cluster interconnect is limited. This enables

trading memory resources proportional to the num-ber of network interfaces (NICs) in exchange for re-duced computational costs on critical code paths.(In our system, we limit the maximum number ofNICs to 256, though it would be straightforward tochange the compile-time constants and to scale toa few thousand.)

3 Architecture

Our system has four layers: (1) an active messageapplications programming interface, (2), a virtualnetwork system that abstracts network interfacesand communication resources, (3), �rmware exe-cuting on an embedded processor on the networkinterface, and (4), processor and interconnectionhardware. This sections presents a brief overviewof each layer and highlights important propertiesrelevant for the NIC-to-NIC transport protocols de-scribed thoroughly in Section 4.

3.1 AM-II API

The Active Messages 2.0 (AM-II) [MC96] providesapplications with the interface to the communica-tions system. It allows an arbitrary number of ap-plications to create multiple communications end-

points used to send and to receive messages usinga procedural interface to active messages primi-tives. Three message types are supported: shortmessages containing 4 to 8 word payloads, mediummessages carrying a minimum of 256 bytes, andbulk messages providing large memory-to-memorytransfers. Medium and bulk message data can besent from anywhere in a sender's address space.The communication layer provides pageable stor-age for receiving medium messages. Upon receiv-ing a medium message, its active message handleris passed a pointer to the storage and can operatedirectly on the data. Bulk message data are de-posited into per-endpoint virtual memory regions.These regions can be located anywhere in a re-ceiver's address space. Receivers identify these re-gions with a base address and length. Applicationscan set and clear event masks to control whetheror not semaphores associated with endpoints areposted when a message arrives into an empty re-ceive queue in an endpoint. By setting the maskand waiting on the semaphore, multi-threaded ap-plications have the option of processing messagesin an event-driven way.Isolating message tra�c for unrelated applica-

tions is done using per-endpoint message tags spec-i�ed by the application. Each outgoing message

2

Figure 1: Processor/NIC node.

contains a message tag for its destination endpoint.Messages are delivered if the tag in the messagematches the tag of the destination endpoint. TheAM-II API provides an integrated return-to-sendererror model for both application-level errors, suchas non-matching tags, and for catastrophic networkfailures, such as losing connectivity with a remoteendpoints. Any message that cannot be deliveredto its destination is returned to its sender. Appli-cations can register per-endpoint error handlers toprocess undeliverable messages and to implementrecovery procedures if so desired. If the system re-turns a message to an application, simply retrans-mitting the message is highly unlikely to succeed.

3.2 Virtual Networks

Virtual networks are collections of endpoints withmutual addressability and the requisite tags nec-essary for communication. While AM-II providesan abstract view of endpoints as virtualized net-work interfaces, virtual networks view collectionsof endpoints as virtualized interconnects. There isa one-to-one correspondence between AM-II end-points and virtual network endpoints.

The virtual networks layer provides direct net-work access via endpoints, protection between un-related applications, and on-demand binding ofendpoints to physical communication resources.Figure 1 illustrates this idea. Applications cre-ate one or more communications endpoints usingAPI functions that call the virtual network seg-ment driver to create endpoint address space seg-ments. Pages of network interface memory providethe backing store for active endpoints, whereas host

Figure 2: Data paths for sending and receiv-

ing short, medium, and bulk active mes-

sages. Short messages are transferred using pro-grammed I/O directly on endpoints in NIC mem-ory. Medium messages are sent and received usingper-endpoint medium message staging areas in thepageable kernel heap that are mapped into a pro-cess's address space. A mediummessage is a single-copy operation at the sending host and a zero-copy operation at the receiving host. Bulk memorytransfers, currently built using medium messages,are single-copy operations on the sender and single-copy operations on the receiver.

3

memory acts as the backing store for less active orendpoints from the on-NIC endpoint \cache". End-points are mapped into a process's address spacewhere they are directly accessed by both the ap-plication and the network interface, thus bypassingthe operating system. Because endpoint manage-ment uses standard virtual memory mechanisms,they leverage the inter-process protection enforcedbetween all processes running on a system.

Applications may create more endpoints than theNIC can accommodate in its local memory. Pro-viding that applications exhibit bursty communi-cation behavior, a small fraction of these endpointsmay be active at any time. Our virtual networksystem takes advantage of this when virtualizingthe physical interface resources. Speci�cally on ourMyrinet system, it uses NIC memory as a cacheof active endpoints and pages endpoints on ando� the NIC on-demand, much like virtual mem-ory systems do with memory pages and frames.Analogous to pagefaults, endpoint faults can occurwhen either an application writes a message intoa non-resident endpoint, or a message arrives fora non-resident endpoint. Endpoint faults also oc-cur whenever messages (sent or received) referencehost memory resources { medium message stag-ing area, arbitrary user-speci�ed virtual memoryregions for sending messages, or endpoint virtualmemory segment for receiving messages { that arenot pinned, or for which there are no current DMAmappings. Network interface virtualization, includ-ing endpoint cache management and the paging ofendpoints, is handled using a custom virtual net-work segment driver.

3.3 NIC Firmware

The �rmware implements a protocol that providesreliable and unduplicated message delivery betweenNICs. The protocols must address four core issues:the scheduling of outgoing tra�c from a set of res-ident endpoints, NIC to NIC ow control mecha-nisms and policies, timer management to scheduleand perform packet retransmissions, and detectingand recovering from errors. Details on the NIC pro-tocols are given in Section 4.The protocols implemented in �rmware deter-

mine the structure of an endpoint. Each endpointhas four message queues: request send, reply send,request receive, and reply receive. Each queue entryholds an active message. Short messages are trans-ferred directly into resident endpoints memory us-ing programmed I/O. Medium and bulk messagesuse programmed I/O for the active message por-

tion and DMA for the associated bulk data trans-fer. Figure 2 illustrates the data ows for short,medium, and bulk messages through the interface.Medium messages require one copy on the senderand zero copies on the receiver. (Bulk messages,currently implemented using mediummessages, re-quire one copy on the sender and one copy on thereceiver. The code for zero-copy bulk transfers ex-ists but has not been su�ciently tested.)

Node1

Switch 258 ID 258

0 1 2 3 4 5 6 7

Node2Node0Node3Node4Node103

Switch 257 ID 257

0 1 2 3 4 5 6 7

Node102Node104Node101Node11

Switch 327 ID 327

0 1 2 3 4 5 6 7

Node12Node10Node13Node14Node16

Switch 981 ID 981

0 1 2 3 4 5 6 7

Node26

Switch 262 ID 262

0 1 2 3 4 5 6 7

Node27Node25Node28Node29 Node21

Switch 256 ID 256

0 1 2 3 4 5 6 7

Node22Node20Node23Node24Node6

Switch 259 ID 259

0 1 2 3 4 5 6 7

Node7Node5Node8Node9Node17Node15Node18Node19Node56

Switch 164 ID 164

0 1 2 3 4 5 6 7

Node51

Switch 162 ID 162

0 1 2 3 4 5 6 7

Node52Node50Node53Node54Node31

Switch 165 ID 165

0 1 2 3 4 5 6 7

Node32Node30Node33Node34Node36

Switch 166 ID 166

0 1 2 3 4 5 6 7

Node57Node55Node91Node92Node37Node35Node38Node39 Node41

Switch 189 ID 189

0 1 2 3 4 5 6 7

Node42Node40Node43Node44Node46

Switch 183 ID 183

0 1 2 3 4 5 6 7

Node47Node45Node48Node49 Node63

Switch 32 ID 32

0 1 2 3 4 5 6 7

Node62Node60Node61Node64Node192

Switch 401 ID 401

0 1 2 3 4 5 6 7

Node194Node195Node193 Node110

Switch 36 ID 36

0 1 2 3 4 5 6 7

Node111Node112Node113Node114Node86

Switch 442 ID 442

0 1 2 3 4 5 6 7

Node87Node85Node88 Node66

Switch 31 ID 31

0 1 2 3 4 5 6 7

Node67Node65Node68Node69Node81

Switch 26 ID 26

0 1 2 3 4 5 6 7

Node82Node80Node83Node84 Node72

Switch 24 ID 24

0 1 2 3 4 5 6 7

Node71Node70Node73Node74 Node76

Switch 33 ID 33

0 1 2 3 4 5 6 7

Node77Node75Node78Node79

Switch 890 ID 890

0 1 2 3 4 5 6 7Switch 889

ID 8890 1 2 3 4 5 6 7

Switch 888 ID 888

0 1 2 3 4 5 6 7Switch 886

ID 8860 1 2 3 4 5 6 7

Switch 885 ID 885

0 1 2 3 4 5 6 7Switch 884

ID 8840 1 2 3 4 5 6 7

Switch 881 ID 881

0 1 2 3 4 5 6 7Switch 880

ID 8800 1 2 3 4 5 6 7

Switch 132 ID 132

0 1 2 3 4 5 6 7Switch 126

ID 1260 1 2 3 4 5 6 7

Switch 127 ID 127

0 1 2 3 4 5 6 7

Switch 1473 ID 1473

0 1 2 3 4 5 6 7Switch 1476

ID 14760 1 2 3 4 5 6 7

Switch 387 ID 387

0 1 2 3 4 5 6 7Switch 1478

ID 14780 1 2 3 4 5 6 7

Switch 4 ID 4

0 1 2 3 4 5 6 7

Switch 386 ID 386

0 1 2 3 4 5 6 7

Switch 883 ID 883

0 1 2 3 4 5 6 7Switch 882

ID 8820 1 2 3 4 5 6 7

Switch 388 ID 388

0 1 2 3 4 5 6 7

Figure 3: Berkeley NOW network topology as

discovered by the mapper. The network map-ping daemons periodically explore and discover thenetwork's current topology, in this case a fat tree-like network with 40 Myrinet switches. The threesub-clusters are currently connected using throughtwo switches using only 11 cables.

3.4 Hardware

The system hardware consists of 100+ 167-MhzSun UltraSPARC workstations interconnected withMyrinet (Figure 3) [BCF+95], a high-speed localarea network with cut-through routing and link-level back-pressure. The network uses 40 8-portcrossbar switches with 160 MB/s full-duplex links.Each host contains a LANai 4.1 network interfacecard on the SBus. Each NIC contains a 37.5 MHzembedded processor, 256 KB of SRAM, a singlehost SBus DMA engine but independent networksend and receive DMA engines.

4 NIC Protocols

This section describes a set of lightweight NICprotocols that support the AM-II communicationsAPI and virtual networks and also provide ba-sic transport protocol functionality. The protocols

4

are implemented as �rmware running on MyricomLANai4.1 cards.

4.1 Endpoint Scheduler

Because our system supports both direct networkaccess and multiprogramming, the NIC has a newtask of endpoint scheduling, i.e., sending messagesfrom the current set of cached endpoints. This sit-uation is di�erent from that of traditional proto-col stacks, such as TCP/IP, where messages fromapplications pass through layers of protocol pro-cessing and multiplexing before ever reaching thenetwork interface, so the NIC services shared out-bound (and inbound) message queues. With virtualnetworks, the queues are di�erentiated.The endpoint scheduling policy chooses how long

to service any one endpoint and which endpointto service next. A simple round-robin algorithmthat gives each endpoint equal but minimal servicetime is fair and is starvation free. If all endpointsalways have messages waiting to send, this algo-rithm might be satisfactory. However, if applica-tion communication is bursty [LTW+94], spendingequal time on each resident endpoint is not opti-mal. Better strategies exist which minimize the useof critical NIC resources examining empty queues.The endpoint scheduling policy must balance op-

timizing the throughput and responsiveness of aparticular endpoint against aggregate throughputand response time. Our current algorithm uses aweighted round-robin policy that focuses resourceson active endpoints. Empty endpoints are skipped.For an endpoint with pending messages, the NICmakes 2k attempts to send, for some parameter k.This holds even after the NIC empties a particularendpoint { it loiters should the host enqueue addi-tional messages. Loitering also allows �rmware tocache state, such as packet headers and constantswhile sending messages from an endpoint, lower-ing per-packet overheads. While larger k's result inbetter performance during bursts, too large a k de-grades system responsiveness with multiple activeendpoints. Empirically, we have chosen a k of 8.

4.2 Lightweight Flow Control

In our system, a ow control mechanism has tworequirements. On one hand, it should allow an ad-equate number of unacknowledged messages to bein ight in order to �ll the communication pipe be-tween a sender and a receiver. On the other, itshould limit the number of outstanding messagesand manage receiver bu�ering to make bu�er over-

runs infrequent. In steady state, a sender shouldnever wait for an acknowledgment in order to sendmore data. Assuming the destination process isscheduled and attentive to the network, given abandwidth B and a round trip time RTT , this re-quires allowing at least B �RTT bytes of outstand-ing data.Our system addresses ow control at three lev-

els: (1) user-level active message credits for eachendpoint, (2) NIC-level stop-and-wait ow controlover multiple, independent logical channels, and (3)network back-pressure.

4.2.1 User-level Credits

The user-level credits rely upon on the request-reply nature of AM-II, allowing each endpoint tohave at most Kuser outstanding requests waitingfor responses. By choosing a Kuser large enough,endpoint-to-endpoint communication proceeds atthe maximum rate. To prevent receive bu�er over- ow, endpoint request receive queues are largeenough to accommodate several senders transmit-ting at full speed. Because senders have at most asmall number, Kuser, of outstanding requests, set-ting the request receive queue to a small multipleof Kuser is feasible. Additional mechanisms, dis-cussed shortly, engage when overruns do occur.In our protocol, with 8 KB packets the band-

width delay product is 31MB=s � 349us = 11345bytes { less than two 8 KB messages. Forshort packets the bandwidth delay product is62; 578msgs=s � 42us = 2.63 messages. To provideslack at the receiver and to optimize arithmeticcomputations, Kuser is rounded to 4. The NICmust provide at least this number of logical chan-nels to accommodate this number of outstandingmessages, as discussed next.

4.2.2 Stop-and-wait Over Logical Channels

To keep the communication pipe full in steadystate, the NIC, which is responsible for time-out/retry, must allow at least Kuser outstand-ing messages since Kuser is chosen to match thebandwidth{delay product of the network. It ac-complishes this by overlayingmultiple, independentlogical channels (Kuser channels for requests andKuser channels for replies) over each physical routeto each destination NIC. Kuser logical channels forboth requests and replies ensure that neither thesender nor the receiver is a bottleneck in steadystate with respect to outstanding data. Having sep-arate request and reply logical channels is necessaryin order to prevent deadlock.

5

Figure 4: NIC channel tables. The NIC chan-nel tables provide easy-access to NIC ow con-trol, timeout/retry, and error detection informa-tion. The NIC uses stop-and-wait ow control oneach channel and manages communication state in-formation in channel table entries. In the send table(left), each entry includes timeout/retry informa-tion (packet timestamp, pointer to an unacknowl-edged packet, number of retries with no receiverfeedback), sequencing information (next sequencenumber to use), and whether the entry is in use ornot. In the receive table (right), each entry con-tains sequencing information for incoming packets(expected sequence number). Half of the channelsare reserved for requests; the other half is reservedfor replies. This is necessary in order to preventdeadlock.

Two simple data structures manage NIC-to-NIC ow control information. These data structuresalso record timeout/retry and error detection infor-mation. Each row of the send channel control tablein Figure 4 holds the states of all channels to a par-ticular destination interface. Each intersecting col-umn holds the state for a particular logical channel.This implicit bound on the number of outstandingmessages enables implementations to trade storagefor reduced arithmetic and address computation.Two simple and easily-addressable data structureswith O(#NICs � #channels) entries are su�cient.

4.2.3 Link-level Back-pressure

Link-level back-pressure ensures that under heavyload, the network does not drop packets. Credit-based ow control in the AM-II library throttlesindividual senders but cannot prevent high con-

tention for a common receiver. With link-levelback-pressure, end-to-end ow control remains ef-fective and its overheads remain small. This tradesnetwork utilization under load { allowing packetsto block and to consume link and switch resources{ for simplicity. Section 5 shows that this hybridscheme performs very well.

4.2.4 Receiver Bu�ering

Some fast communication layers prevent bu�eroverruns by dedicating enough receiver bu�er spaceto accommodate all messages potentially in ight.With P processors, credits for K outstanding mes-sages, and a single endpoint per host, this requiresO(K � P ) storage. Small-scale systems with oneendpoint made allocating O(K � P ) storage practi-cal. However, large scale systems with a large num-ber of communication endpoints requires O(K �E)storage, where E is number of endpoints in a vir-tual network. This has serious scaling and stor-age utilization problems that makes pre-allocationapproaches impractical, as the storage grows pro-portionally to virtualized resources and not phys-ical ones. Furthermore, with negligible packet re-transmission costs, alternative approaches involv-ing modest pre-allocated bu�ers and packet re-transmission become practical.

We provide request and response receive queues,each with 16 entries (4 �Kuser), for each endpoint.These are su�cient to absorb load from up tofour senders transmitting at their maximum rates.When bu�er over ow occurs, the protocol dropspackets and NACKs senders. The �rmware auto-matically retransmits such messages. An importantconsequence of sizing request and reply queues tobe 4 � Kuser entries deep is that our virtual net-work segment driver can use a single virtual mem-ory page per endpoint, simplifying its memoryman-agement activities.

4.3 Timeout and Retry

To guarantee at-most-once delivery semantics andaddress transient hardware errors (e.g. CRC er-rors and truncated packets), a communication sys-tem must perform timeout and retransmission ofpackets. The timeout/retry algorithm determineshow packet retransmissions events are scheduled,how they are deleted and how retransmission isperformed. Sending a packet schedules a timerevent, receiving an acknowledgment deletes theevent, and all send table entries are periodicallyscanned for packets to retransmit. The per-packet

6

timeout/retry costs must be small. This requiresthat the costs of scheduling a retransmission eventon each send operation and deleting a retransmis-sion event on an acknowledgment reception to benegligible. Depending on the granularity of thetimeout quantum and the frequency of time-outevents, di�erent trade-o�s exist that shift costs be-tween per-packet operations and retransmissions.For example, we use a larger timer quantum andlow per-packet costs at the price of more expensiveretransmissions. Section 5 shows that this hybridscheme has near zero amortized cost for workloadswhere packets are not retransmitted.

Our transport protocol implements timeout andretry with positive acknowledgments in the inter-face �rmware. This provides e�cient acknowledge-ments and minimizes expensive SBus transactions.(We currently do not perform the obvious piggy-backing of ACKs and NACKs on active messagereply messages). Channel management tables storetimeout and transmission state. Sending a packetinvolves reading a sequence number from the appro-priate entry in the send table indexed by the desti-nation NIC and a free channel, saving a pointerto the packet for potential retransmissions, andrecording the time the packet was sent. The re-ceiving NIC then looks up sequencing informationfor the incoming packet in the appropriate receivetable entry indexed with the sending NIC's id andthe channel on which the message was sent. If thesequencing information matches, the receiver sendsan acknowledgment to the sender. Upon its receipt,the sender which updates its sequencing informa-tion and frees the channel for use by a new packet.

By using a simple and easily-addressable datastructures, each with O(#NICs � #channels) en-tries, scheduling and deleting packet retransmissionevents take constant time. For retransmissions,though, the NIC perform O(#NICs � #channels)work. Maintaining unacknowledged packet countsfor each destination NIC reduces this cost signi�-cantly. Sending a packet increments a counter tothe packet's destination NIC and receiving the as-sociated acknowledgement decrements the counter.These counts reduce retransmission overheads to beproportional to the total number of network inter-faces.

4.3.1 Unavailable Resources

Virtual networks introduce new issues for at-most-once delivery semantics in the presence of hardwareerrors. Because endpoints may be non-resident ormay not have DMA resources (e.g. medium mes-

sage staging areas) set up a packet may need to beretried because of unavailable resources.Our system sends NACKs to senders when des-

tination endpoint resources are unavailable. Uponreceiving such a NACK, a sender notes that the re-ceiving interface is still reachable. The packet willbe retried by the same timeout/retry mechanismused to deal with hardware errors. Like hardwareerrors, resource unavailability is expected to be in-frequent. Therefore, using the same timeout/retrymechanism, which may use coarse-grain timeouts,should add very little overhead.

4.4 Error handling

Our system addresses packet delivery problemsat three levels: NIC-to-NIC transport protocols,the AM-II API return-to-sender error model, andthe user-level network management daemons. Thetransport protocols are the building blocks onwhich the higher-level API error models and thenetwork management daemons depend. The trans-port protocols handle transient network errors bydetecting and dropping each erroneous packet andrelying upon timeouts and retransmissions for re-covery. After 255 retransmission, for which noACKs or NACKs were received, the protocol de-clares a message as undeliverable and returns it tothe AM-II layer. (Timeout/retransmission mecha-nisms require that sending interfaces have a copy ofeach unacknowledged message anyway.) The AM-IIlibrary invokes a per-endpoint error handler func-tion so that applications may take appropriate re-covery actions.

4.4.1 Transient Errors

Positive acknowledgement with timeout and re-transmission ensures the delivery of packets withvalid routes. Not only can data packets be droppedor corrupted but protocol control messages can beas well. To ensure that data packets are never de-livered more than once to a destination despite re-transmissions, they are tagged with sequence num-bers and timestamps. With a maximum of 2k

outstanding messages, detecting duplicates requires2k+1 sequence numbers. For our alternating-bitprotocol on independent logical channels, k = 0.

4.4.2 Return-to-Sender

The NIC determines that destination endpointsare unreachable by relying upon its timeout andretransmission mechanisms. If after 255 retries(i.e., several seconds) the NIC receives no ACKs or

7

NACKs from the receiver, the protocol deems thedestination endpoint as unreachable. When thishappens, the protocols marks the sequence numberof the channel as uninitialized and returns the orig-inal message back to user-level via the endpoint'sreply receive queue. The application handles un-deliverable message as it would any other activemessage, with a user-speci�able handler function.Should no route to a destination NIC exist, all ofits endpoints are trivially unreachable.

4.4.3 Network Management Errors

The system uses privileged mapper daemons, onefor each interface on each node of the system, toprobe and to discover the current network topol-ogy. Given the current topology, the daemonselect a leader that derives and distributes a set ofmutually deadlock-free routes to all NICs in thesystem [MCS+97]. Discovering the topology ofa source-routed, cut-through network with anony-mous switches like Myrinet requires use of net-work probe packets that may potentially deadlockon themselves or on other messages in the net-work. Hence online mapping daemons can causetruncated and corrupted packets to be received byinterfaces (as a result of switch hardware detect-ing and breaking deadlocks) even when the hard-ware is working perfectly. From the transportprotocol's perspective, mapper daemons performtwo specialized functions: (1) sending and receiv-ing probe packets with application-speci�ed source-based routes to discover links, switches, and hostsand (2) reading and writing entries in NIC routingtables. These special functions can be performedusing privileged endpoints available to privilegedprocesses.

5 Performance Results

This section presents performance measurementsand analysis. The �rst microbenchmarks charac-terize the system using the LogP communicationmodel and lead to a comparison with a previousgeneration of an active message system and to anunderstanding of the costs of the added function-ality. The next benchmarks examine performancebetween hosts under varying degrees of destinationendpoint contention. It concludes with an examina-tion of system performance as the number of activevirtual networks increases. All programs were runon the Berkeley Network of Workstations systemin a stand-alone environment. Topology acquisi-tion and routing daemons were disabled, eliminat-

ing background communication activity normallypresent.

5.1 LogP Characterization

The LogP communications model uses four param-eters to characterize the performance of communi-cation layers. This parameterization enables thefair comparison of di�erent communication layers.The microbenchmarks of [CLM+95] automaticallyderive the model parameters of L (latency), over-head (o) and gap (g). Each parameter has a sim-ple interpretation. The number of processors (P )is given. The overhead has two components, thesending overhead (Os) and the receiving overhead(Or). These measure the host processor time spentwriting a message to and reading a message froman endpoint, respectively. The gap measures thetime through the rate-limiting stage in the systemand the latency is the remaining time unaccountedfor by the overheads.

0

5

10

15

20

25

30

35

40

45

LogP Parameter Comparison

Tim

e (m

icro

seco

nds)

GAM 1.90 4.00 5.80 5.50 21.00

AM-II 4.09 4.28 15.98 12.60 41.94

Os Or g L RTT

Figure 5: Performance characterization using

the LogP model. LogP parameters for two ac-tive message systems on identical hardware: AM-II, our new general-purpose active message systemwith virtual networks and the return-to-sender er-ror model, and GAM, an earlier active message sys-tem for SPMD parallel programs without virtualnetworks, an error model and other features.

Figure 5 shows the LogP parameters for AM-IIand GAM. It shows the measured AM-II round-trip time of 41.94 microseconds as compared withGAM's round-trip time of 21 microseconds. Ofthe 17.37 microsecond one-way time, the systemspends 5 microseconds writing the message intothe sender's endpoint and 3.3 microseconds read-ing the messages from the receiver's endpoint. Thetwo network interfaces spend 13.82 microseconds

8

transmitting the data message as well as transmit-ting its acknowledgement. For AM-II, the gap islarger than (Os + Or) because the network inter-face �rmware limits the message rate. For GAM,the gap is smaller than its (Os + Or) indicatingthat the active message library code executing onthe host processors limits the message rate.Although in both cases, the microbenchmark use

active messages with 4-word payloads, the AM-IIsend overhead is larger because additional informa-tion such as a tag is stored to the network inter-face across the SBus. The AM-II gap is also largerbecause the �rmware constructs a private headerfor each message, untouchable by any application,that is sent using a separate DMA operation. Thisrequires additional �rmware instructions and mem-ory accesses.

0

5

10

15

20

25

30

35

1 1024 2048 3072 4096 5120 6144 7168 8192

Message size (bytes)

Ban

dwid

th (

MB

/s)

Consistent DMAStreaming DMA

Figure 6: Sending bandwidth as a function ofmessage size in bytes. Consistent host-to-NICDMA operations across the SBus have higher per-formance for small transfers. Streaming transfersobtain higher performance once the data transfertimes swamp the cost of ushing a hardware streambu�er in the SBus bridge chip.

5.2 Contention-Free Performance

Figure 6 shows the endpoint-to-endpoint band-width between two machines. Because the NICcan only DMA messages between the network andits local memory, a store-and-forward delay is in-troduced for large messages moving data betweenhost memory and the interface. Although the cur-rent network interface �rmware does not pipelinebulk data transfers to eliminate this delay, stream-ing transfers nevertheless reach 31 MB/s with 4KBmessages. (With GAM, pipelining of DMA oper-ations to receive messages from the network with

DMA operations to move bu�ers to host memoryincreased bulk transfer performance to 38 MB/s.)

Perm Avg BW Agg BW Avg RTT

Cshift 25.42 MB/s 2.33 GB/s 67.8 us

Neighbor 30.97 MB/s 2.85 GB/s 47.5 us

Bisection 5.65 MB/s 0.52 GB/s 50.8 us

Table 1: This table shows aggregate bandwidth andaverage round trip times for 92 nodes with di�er-ent message permutations. In the cshift permuta-tion, each node sends requests to its right neighborand replies to requests received from its left neigh-bor. With neighbor, adjacent nodes perform pair-wise exchanges. In bisection, pairs of nodes sep-arated by the network bisection perform pairwiseexchanges. Bandwidth measurements used mediummessages, whereas RTTmeasurements used 4-wordactive messages. (More recent work has con-siderably improved upon this performance. Seehttp://now.cs.berkeley.edu)

Table 1 presents three permutations and theirresulting average per-host sending bandwidths,aggregate sending bandwidths, and per-messageround-trip times when run on 92 machines of theNOW. Each column shows that the bandwidthscales as the system reaches a non-trivial size. The�rst two permutations, circular shift and neighborexchange, are communication patterns with sub-stantial network locality. As expected, these casesperform well, with bandwidths near their peaks andper-message round-trip times within a factor of 2of optimal. The bisection exchange pattern showsthat a large number of machines can saturate thenetwork bisection bandwidth. Refer to �gure 3 tosee the network topology and the small number ofbisection cables.

5.3 Single Virtual Network

The next three �gures show the performance of thecommunication subsystem in the presence of con-tention, speci�cally when all hosts send to a com-mon destination host. All tra�c destined for thecommonhost is also destined for the same endpoint.For reasons that will become clear, the host withthe common destination is referred to as the serverand all other hosts are referred to as the clients.Figure 7 shows the aggregate message rate of

the server (top line) as the number of clients send-ing 4-word requests to it and receiving 4-word re-sponse messages increases. Additionally it shows

9

0

10000

20000

30000

40000

50000

60000

70000

80000

1 7 13 19 25 31 37 43 49 55 61 67 73 79 85 91# of senders

Mes

sage

rat

e (m

sgs/

sec)

Aggregate Msg RateAvg Sender Rate

Figure 7: Active Message rates with destinationendpoint contention within a single virtual network.

0

5

10

15

20

25

30

35

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89

# of senders

Ban

dwid

th (

MB

/s)

Aggregate BWAvg Sender BW

Figure 8: Delivered bandwidths with destinationendpoint contention within a single virtual network.

the average per-client message rate (bottom line)as the number of clients increases to 92. Figure8 presents similar results, showing the sustainedbandwidth with bulk transfers to the server as thenumber of clients sending 1KB messages to it andreceiving 4-word replies. The average per-clientbandwidth gracefully and fairly degrades. We con-jecture that the uctuation in the server's aggre-gate message rates and bandwidths arises from ac-knowledgements for reply messages encounteringcongestion (namely other requests also destined forthe server). The variation in per-sender rates andbandwidths are too small to be observable on theprinted page. Figure 9 shows the average per-clientround-trip time as the number of clients grows to92 hosts. The slope of the line is exactly the gapmeasured in the LogP microbenchmarks.

0

200

400

600

800

1000

1200

1400

1600

1 5 9 13 17 21 25 29 33 37 41 45 49 53 57 61 65 69 73 77 81 85 89

# of senders

Ave

rage

rou

nd tr

ip ti

me

(us)

Figure 9: Round-trip times with destination end-point contention within a single virtual network.

5.4 Multiple Virtual Networks

We can extend the previous benchmark to stressvirtual networks. First, by increasing the numberof server endpoints up to the maximum of 7 thatcan be cached in the interface memory, and thencontinuing to incrementally add endpoints to in-creasingly overcommit the resources. Thus, ratherthan clients sharing a common destination end-point, each client endpoint now has its own ded-icated server endpoint. With N clients, the serverprocess has N di�erent endpoints where each one ispaired with a di�erent client, resulting in N di�er-ent virtual networks. This contains client messageswithin their virtual network and guarantees thatmessages in other virtual networks make forwardprogress.

0

10000

20000

30000

40000

50000

60000

70000

80000

1 2 3 4 5 6 7

Number of Clients

Mes

sage

Rat

e (m

sgs/

s)

ServerClient

Figure 10: Aggregate server and per-client messagerates with small numbers of virtual networks.

Figure 10 shows the average server message rate10

and per-client message rates (with error bars) overa �ve minute interval. The number of clients con-tinuously making requests of the server varies fromone to seven. In this range, the network interface'sseven endpoint frames can accommodate all serverendpoints. This scenario stresses both the schedul-ing of outgoing replies and the multiplexing of in-coming requests on the server. The results showserver message rates within 11% of their theoreti-cal peak of 62; 578 messages per second given themeasured LogP gap of 15:98 microseconds. Theper-client message rates with within 16% of theirideal fair share of 1=N th of the server's through-put. Steady server performance and the gracefulresponse of the system to increasing load demon-strate the e�ective operation of the ow-control,endpoint scheduling, and multiplexing mechanismsthroughout the system.

0

10000

20000

30000

40000

50000

60000

70000

80000

1 4 7 10 13 16 19 22 25 28 31

Number of Clients

Mes

sage

Rat

e (m

sgs/

s)

ServerClient

Figure 11: Aggregate server and per-client messagerates with large numbers of virtual networks.

Figure 11 extends the scenario shown in Fig-ure 10. Previously, the server host was a single-threaded process, polling its endpoints in a round-robin fashion. When the number of busy end-points exceeds the network interface capacity, thevirtual network system actively loads and unloadsendpoints into and out of interface memory in anon-demand fashion. When the server attempts towrite a reply message into a non-resident endpoint(or when a request arrives for a non-resident end-point), a pagefault occurs and the virtual networkdriver moves the backing storage and re-mappedthe endpoint pages as necessary. However, duringthis time the server process is suspended and thusit neither sends nor receives additional messages.Messages arriving for non-resident endpoints andfor endpoints being relocated are NACKed. Thiswould result in a signi�cant performance drop when

interface endpoint frames become overcommitted.

To extend this scenario and to avoid the pitfallsof blocking, the server spawns a separate thread(and Solaris LWP) per client endpoint. Each serverthread waits on a binary semaphore posted by thecommunication subsystem upon a message arrivalthat causes an endpoint receive queue to becomenon-empty. Additional messages may be deliveredto the endpoint while the server thread is sched-uled. Once running, the server thread disables fur-ther message arrival events and processes a batchof requests before re-enabling arrival events andagain waiting on the semaphore. Apart from be-ing a natural way to write the server, this approachallows a large number of server threads to be sus-pended pending resolution of their endpoint page-faults while server threads with resident endpointsremain runnable and actively send and receive mes-sages.

The results show that event mechanisms andthread overheads degrade peak server message ratesby 15% to 53; 488 messages per second. While vari-ation in average per-client message rates across the�ve minute sampling interval remains small, thevariation in message rates between clients increaseswith load, with some clients rates 40% higher thanaverage while others are 36% lower than average.A �ner-grain time series analysis (not shown) ofclient communication rates reveals the expected be-havior: clients with resident server endpoints burstmessages at rates as shown in Figure 10 while oth-ers wait until both their endpoints become residentand the appropriate server thread is scheduled.

6 Related Work

Recent communication systems can be categorizedby their support for virtualization of network inter-faces and communication resources and their posi-tions on multiprogramming and error handling.

GAM, PM, and FM use message-based APIswith little to no support for multiprogramming.GAM is the canonical fast active message layer.PM and FM add support for gang-scheduling ofparallel programs. These systems are driven pri-marily by the needs of SPMD parallel comput-ing, such as support for MPI and portability toMPPs. FM handles receive bu�er overruns but ig-nores other types of network errors. None of thesesystems have explicit error models which hindersthe implementation of highly-available and non-scienti�c applications.

SHRIMP, U-Net [vEBB+95] and Hamlyn are11

closer to our system. These systems provide di-rect, protected access to network interfaces usingtechniques similar to those found in applicationdevice channels [DPD94]. The SHRIMP project,which uses virtual memory mapped communica-tion model, has run multiple applications and haspreliminary multiprogramming results. U-Net andU-Net/MM can support multiprogramming. Ham-lyn presented a vision of sender-based communica-tion that should have been able to support multi-programming, but demonstrated results using onlyping-pong style benchmarks.

The most important distinction between previ-ous work and our own lies in the virtualization ofnetwork interfaces and communication resources.In SHRIMP, the level of indirection used to couplevirtual memory to communication e�ectively vir-tualizes the network. U-Net provides virtualizedinterfaces, but leaves routing, bu�er management,reliable message delivery and other protocol issuesto higher-layers. Hamlyn allows a process to mapcontiguous regions of NIC-addressable host mem-ory into its address space. These \messages areas"a�ord a level of indirection that allows the systemto virtualize the network interface. The positiontaken on virtualization has direct impact on the er-ror model. In the event of an error, SHRIMP andHamlyn deliver signals to processes. U-Net dele-gates responsibility for providing adequate bu�erresources and conditioning tra�c to higher-levelprotocols, and drops packets when resources are un-available.

7 Conclusions

Bringing direct, protected communication intomainstream computing requires a general-purposeand robust communication protocol. The AM-IIAPI and virtual networks abstraction extends tra-ditional active messages with reliable message de-livery, a simple yet powerful error model and sup-ports use in arbitrary sequential and parallel pro-grams. We have presented the design of the NIC-to-NIC transport protocols required by this moregeneral system. For our Myrinet implementation,we have measured the costs of the generality rel-ative to GAM, a minimal active message layer, onthe same hardware. In particular, we have exploredthe costs associated with endpoint scheduling, owcontrol, timer management, reliable message deliv-ery and error handling.

The implementation achieves end-to-end laten-cies of 21 microseconds for short active messages

with a peak bandwidth of 31 MB/s. These num-bers represent twice the end to end latency and 77%of the bandwidth provided by GAM. The cost of re-liable message delivery makes the most signi�cantcontribution above basic communication costs. Us-ing additional benchmarks, we have demonstratedthat the protocols provide robust performance andgraceful degradation for the virtual networks ab-straction, even when physical network interface re-sources are overcommitted by factors of 12 or more.These benchmarks demonstrate the feasibility oftruly virtualizing network interfaces and their re-sources and show the importance of supportingmulti-threaded applications.The NIC-to-NIC protocols discussed in this pa-

per perform well, and, enable a diverse set of timelyresearch e�orts. Other researchers at Berkeleyare actively using this system to investigate ex-plicit and implicit techniques for the co-schedulingof communicating processes [DAC96], an essentialpart of high-performance communication in mul-tiprogrammed clusters of uni and multiprocessorservers. Related work on clusters of SMPs [LMC97]investigates the use of multiple network interfacesand multiprotocol active message layers. The im-pact of packet switched networks, such as gigabitethernet, on cluster interconnect protocols is anopen question. We are eager to examine the ex-tent to which our existing protocol mechanisms andpolicies apply in this new regime.

8 Acknowledgments

We thank Rich Martin and Remzi Arpaci-Dusseaufor providing valuable feedback on earlier versionsof this paper. Bob Felderman provided valuableinformation on Myrinet at all hours of the night.We thank Alec Woo for his debugging e�orts aswell as his port of the software to Pentium-basedplatforms, and Eric Anderson for discussions onspecializations, and improvements to the API er-ror model. We especially thank Andrea Arpaci-Dusseau for comments and suggestions for improv-ing this paper.

References

[BDF+95] M. Blumrich, C. Dubnicki, E. Felton, and K. Li.

Virtual-Memory Mapped Network Interfaces. In IEEE

Micro Magazine, Feb. 1995, pp. 21-28.

[BCF+95] N. Boden, D. Cohen, R. Felderman, A. Kulawik.C. Seitz, J. Seizovic, andW. Su.Myrinet: A Gigabit per

Second Local Area Network. In IEEE Micro Magazine,February 1995, pp. 29-36.

12

[CLM+95] D. Culler, L. Liu, R. Martin, and C. Yoshikawa.Assessing Fast Network Interfaces. In IEEE Micro

Magazine, Feb. 1996, pp. 35-43.

[DAC96] A. Dusseau, R. Arpaci, D. Culler. E�ective Dis-tributed Scheduling of Parallel Workloads. In Proceed-

ings of 1996 ACM Sigmetrics International Conference

on Measurement and Modeling of Computer Systems,1996.

[DPD94] P. Druschel, L. Peterson, and B. Davie. Experi-ences with a High-speed Network Adaptor: A SoftwarePerspective. In Proceedings of ACM SIGCOMM '94

Symposium, August, 1994, pp. 2-13.

[LMC97] S. Lumetta, A. Mainwaring, D. Culler. Multi-Protocol Active Messages on a Cluster of SMP's. InProceedings of SC97, November 1997.

[LTW+94] W. Leland, M. Taqqu.W. Willinger, and D. Wil-son. On the Self-SimilarNature of Ethernet Tra�c (Ex-tendedVersion). IEEE/ACM Transactions on Network-ing, 2(1):1-15, Feb. 1994.

[MC96] A. Mainwaring and D. Culler. Active Message Ap-plication Programming Interface and Communication

Subsystem Organization. Technical Report CSD-96-918, University of California at Berkeley, October 1996.

[MCS+97] A. Mainwaring, B. Chun, S. Schleimer, and D.

Wilkerson. SystemArea Network Mapping.Proceedingsof 9th ACM Symposium on Parallel Algorithms and

Architectures, June 1997, pp. 116-126.

[PKC97] S. Pakin, Karacheti, and A. Chien. Fast Messages

(FM): E�cient, Portable Communication for Worksta-tion Clusters and Massively-Parallel Processors. IEEE

Parallel and Distributed Technology, Vol. 5, No. 2, Apr-June 1997.

[PLC95] S. Pakin, M. Lauria, and A. Chien. High Perfor-

mance Messaging on Workstations: Illinois Fast Mes-sages (FM) for Myrinet. In Proceedings on Supercom-

puting '95, December 1995.

[PT97] L. Prylli and BernardTourancheau. ProtocolDesignfor High Performance Networking: a Myrinet Expe-rience. Laboratoire de l'Informatique du Parallelisme,

Lyon, France, Research Report No. 97-22, July 1997.

[THI96] H. Tezuka, A. Hori, and Y. Ishikawa, PM: A High-Performance Communication Library for Multi-user

Parallel Environments, RWC, Tsukuba, Japan, Tech-nical Report TR-96015, 1996,

[vEBB+95] T. von Eicken, A. Basu, V. Buch, and W. Vo-

gels. U-Net: A User-level Network Interface for Paralleland Distributed Computing. In Proceedings of the 15th

ACM Symposium on Operating System Principles, De-cember 1995.

13