details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for...

14

Transcript of details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for...

Page 1: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

System Support for Bandwidth Management and Content

Adaptation in Internet Applications

David Andersen, Deepak Bansal, Dorothy Curtis, Srinivasan Seshan�, Hari Balakrishnan

M.I.T. Laboratory for Computer Science

Cambridge, MA 02139

fdga, bansal, dcurtis, srini, [email protected]

Abstract

This paper describes the implementation and evaluationof an operating system module, the Congestion Manager(CM), which provides integrated network ow manage-ment and exports a convenient programming interfacethat allows applications to be noti�ed of, and adapt to,changing network conditions. We describe the API bywhich applications interface with the CM, and the archi-tectural considerations that factored into the design. Toevaluate the architecture and API, we describe our im-plementations of TCP; a streaming layered audio/videoapplication; and an interactive audio application usingthe CM, and show that they achieve adaptive behaviorwithout incurring much end-system overhead. All owsincluding TCP bene�t from the sharing of congestioninformation, and applications are able to incorporatenew functionality such as congestion control and adap-tive behavior.

1 Introduction

The impressive scalability of the Internet infrastructureis in large part due to a design philosophy that advo-cates a simple architecture for the core of the network,with most of the intelligence and state management im-plemented in the end systems [10]. The service modelprovided by the network substrate is therefore primar-ily a \best-e�ort" one, which implies that packets maybe lost, reordered or duplicated, and end-to-end delaysmay be variable. Congestion and accompanying packetloss are common in heterogeneous networks like the In-ternet because of overload, when demand for router re-sources, such as bandwidth and bu�er space, exceedswhat is available. Thus, end systems in the Internetshould incorporate mechanisms for detecting and react-ing to network congestion, probing for spare capacitywhen the network is uncongested, as well as managingtheir available bandwidth e�ectively.

Previous work has demonstrated that the result ofuncontrolled congestion is a phenomenon commonly� Carnegie Mellon University, Pittsburgh, PA;[email protected]

called \congestion collapse" [8, 13]. Congestion collapseis largely alleviated today because the popular end-to-end Transmission Control Protocol (TCP) [30, 40] in-corporates sound congestion avoidance and control al-gorithms. However, while TCP does implement con-gestion control [18], many applications including theWeb [6, 12] use several logically di�erent streams in par-allel, resulting in multiple concurrent TCP connectionsbetween the same pair of hosts. As several researchershave shown [2, 3, 27, 28, 42], these concurrent connec-tions compete with { rather than learn from { eachother about network conditions to the same receiver,and end up being unfair to other applications that usefewer connections. The ability to share congestion in-formation between concurrent ows is therefore a usefulfeature, one that promotes cooperation among di�erent ows rather than adverse competition.

In today's Internet is the increasing number of appli-cations that do not use TCP as their underlying trans-port, because of the constraining reliability and order-ing semantics imposed by its in-order byte-stream ab-straction. Streaming audio and video [25, 34, 41] andcustomized image transport protocols are signi�cant ex-amples. Such applications use custom protocols thatrun over the User Datagram Protocol (UDP) [29], oftenwithout implementing any form of congestion control.The unchecked proliferation of such applications wouldhave a signi�cant adverse e�ect on the stability of thenetwork [3, 8, 13].

Many Internet applications deliver documents andimages or stream audio and video to end users andare interactive in nature. A simple but useful �gure-of-merit for interactive content delivery is the end-to-enddownload latency; users typically wait no more than afew seconds before aborting a transfer if they do notobserve progress. Therefore, it would be bene�cial forcontent providers to adapt what they disseminate to thestate of the network, so as not to exceed a thresholdlatency. Fortunately, such content adaptation is possi-ble for most applications. Streaming audio and videoapplications typically encode information in a range offormats corresponding to di�erent encoding (transmis-sion) rates and degrees of loss resiliency. Image encod-ing formats accommodate a range of qualities to suit a

Page 2: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

variety of client requirements.Today, the implementor of an Internet content dis-

semination application has a challenging task: for herapplication to be safe for widespread Internet deploy-ment, she must either use TCP and su�er the conse-quences of its fully-reliable, byte-stream abstraction, oruse an application-speci�c protocol over UDP. Withthe latter option, she must re-implement congestioncontrol mechanisms, thereby risking errors not just inthe implementation of her protocol, but also in theimplementation of the congestion controller. Further-more, neither alternative allows for sharing conges-tion information across ows. Finally, the commonapplication programming interface (API) classes fornetwork applications|Berkeley sockets, streams, andWinsock [31]|do not expose any information about thestate of the network to applications in a standard way1.This makes it diÆcult for applications running on ex-isting end host operating systems to make an informeddecision, taking network variables into account, duringcontent adaptation.

1.1 The Congestion Manager

Our previous work provided the rationale, initial design,and simulation of the Congestion Manager, an end-system architecture for sharing congestion informationbetween multiple concurrent ows [3]. In this paper, wedescribe the implementation and evaluation of the CMin the Linux operating system. We focus on a versionof the CM where the only changes made to the currentIP stack are at the data sender, with feedback aboutcongestion or successful data receptions being providedby the receiver CM applications to their sending peers,which communicate this information to the CM via anAPI. We present a summary of the API used by applica-tions to adapt their transmissions to changing networkconditions, and focus on those elements of the API thatchanged in the transition from the simulation to the im-plementation.

We evaluate the Congestion Manager by posing andanswering several key questions:

Is its callback interface, used to inform ap-plications of network state and other events, ef-fective for a diverse set of applications to adaptwithout placing a signi�cant burden on develop-ers?

Because most robust congestion control algorithmsrely on receiver feedback, it is natural to expect thata CM receiver is needed to inform the CM sender ofsuccessful transmissions and packet losses. However,to facilitate deployment, we have designed our systemto take advantage of the fact that several protocols in-cluding TCP and other applications already incorporatesome form of application-speci�c feedback, providing1 Utilities like netstat and ifconfig provide some infor-mation about devices, but not end-to-end performanceinformation that can be used for adapting content.

the CM with the loss and timing information it needsto function e�ectively.

Using the CM API, we implement several case stud-ies both in and out of the kernel, showing the applica-bility of the API to many di�erent application archi-tectures. Our implementation of a layered streamingaudio/video application demonstrates that the CM ar-chitecture can be used to implement highly adaptivecongestion controlled applications. Adaptation via theCM helps these applications achieve better performanceand also be fair to other ows on the Internet.

We have also modi�ed a legacy application|the In-ternet audio tool vat from the MASH toolkit [23]|to use the CM to perform adaptive real-time delivery.Since less than one hundred lines of source code mod-i�cation was required to CM-enable this complex ap-plication and make it adapt to network conditions, webelieve it demonstrates the ease with which the CMmakes applications adaptive.

Is the congestion control correct?As a trusted kernel module, the CM frees both trans-

port protocols and applications from the burden of im-plementing congestion management. We show that theCM behaves in the same network-friendly manner asTCP for single ows. Furthermore, by integrating owinformation between both kernel protocols and user ap-plications, we ensure that an ensemble of concurrent ows is not an overly aggressive user of the network.

In today's o�-the-shelf operating systems,does the CM place any performance limitationsupon applications?

We �nd that our implementation of TCP (whichuses the CM for its congestion control) has essentiallythe same performance as standard TCP, with the addedbene�ts of integrated congestion management across ows, with only small (0-3%) CPU overhead.

In a CM system where no changes are made to thereceiver protocol stack, UDP-based applications mustimplement a congestion feedback mechanism, resultingin more overhead compared to the TCP applications.However, we show that these applications remain vi-able, and that the architectural change and API callsreduce worst-case throughput by 0 - 25%, even for appli-cations that desire �ne-grained information about thenetwork on a per-packet basis.

To our knowledge, this is the �rst implementationof a general application-independent system that com-bines integrated ow management with a convenientAPI to enable content adaptation. The end-result isthat applications achieve the desirable congestion con-trol properties of long-running TCP connections, to-gether with the exibility to adapt data transmissionsto prevailing network conditions.

The rest of this paper is organized as follows. Sec-tion 2 describes our system architecture and implemen-tation. Section 3 describes how network-adaptive appli-cations can be engineered using the CM, while Section 4presents the results of several experiments. In Section 5,

Page 3: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

we discuss some miscellaneous details and open issuesin the CM architecture. We survey related work in Sec-tion 6 and conclude with a summary in Section 7.

2 System Architecture and Implemen-

tation

The CM performs two important functions. First, it en-ables eÆcient multiplexing and congestion control by in-tegrating congestion management across multiple ows.Second, it enables eÆcient application adaptation tocongestion by exposing its knowledge of network con-ditions to applications. Most of the CM functionalityin our Linux implementation is in-kernel; this choicemakes it convenient to integrate congestion manage-ment across both TCP ows and other user-level pro-tocols, since TCP is implemented in the kernel.

To perform eÆcient aggregation of congestion infor-mation across concurrent ows, the CM has to identifywhich ows potentially share a common bottleneck linken route to various receivers. In general, this is a diÆ-cult problem, since it requires an understanding of thepaths taken by di�erent ows. However, in today's In-ternet, all ows destined to the same end host take thesame path in the common case, and we use this groupof ows as the default granularity of ow aggregation2.We call this group a macro ow: a group of ows thatshare the same congestion state, control algorithms, andstate information in the CM. Each ow has a sendingapplication that is responsible for its transmissions; wecall this a CM client. CM clients are in-kernel protocolslike TCP or user-space applications.

The CM incorporates a congestion controller thatperforms congestion avoidance and control on a per-macro ow basis. It uses a window-based algorithm thatmimics TCP's additive-increase/multiplicative decrease(AIMD) scheme to ensure fairness to other TCP owson the Internet. However, the modularity provided bythe CM encourages experimentation with other non-AIMD schemes that may be better suited to speci�cdata types such as audio or video.

While the congestion controller determines what thecurrent window (rate) ought to be for each macro ow,a scheduler decides how this is apportioned among theconstituent ows. Currently, our implementation usesa standard unweighted round-robin scheduler.

In-kernel CM clients such as a TCP sender use CMfunction calls to transmit data and learn about net-work conditions and events. In contrast, user-spaceclients interact with the CM using a portable, platform-independent API described in Section 2.1. A platform-dependent CM library, libcm, is responsible for inter-facing between the kernel and these clients, and is de-scribed in Section 2.2. These components are shown in2 This is not strictly true in the presence of network-layerdi�erentiated services. We address this issue later in thissection and in Section 5.

CM

TCP1 TCP2 TCP3UDPCC

UDP2 UDP3

IP

Network

Web serverHTTP Video

RTP1

Callbacks for orchestratingtransmissions and applicationnotification

libcm

AudioRTP2

Figure 1. Architecture of the congestion manager atthe data sender, showing the CM library and the CM.The dotted arrows show callbacks, and solid lines showthe datapath. UDP-CC is a congestion-controlled UDPsocket implemented using the CM.

Figure 1.When a client opens a CM-enabled socket, the CM

allocates a ow to it and assigns the ow to the appro-priate macro ow based on its destination. The clientinitiates data transmission by requesting permission tosend data. At some point in the future depending on theavailable rate, the CM issues a callback permitting theclient to send data. The client then transmits data, andtells the CM it has done so. When the client receivesfeedback from the receiver about its past transmissions,it noti�es the CM about these and continues.

When a client makes a request to send on a ow, thescheduler checks whether the correspondingmacro ow'swindow is open. If so, the request is granted and theclient noti�ed, upon which it may send some data.Whenever any data is transmitted, the sender's IP layernoti�es the CM, allowing it to \charge" the transmis-sion to the appropriate macro ow. When the client re-ceives feedback from its remote counterpart, it informsthe CM of the loss rate, number of bytes transmittedcorrectly, and the observed round trip time. On a suc-cessful transmission, the CM opens up the window ac-cording to its congestion management algorithm andgrants the next, if any, pending request on a ow as-sociated with this macro ow. The scheduler also hasa timer-driven component to perform background tasksand error handling.

Page 4: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

2.1 CM API

The CM API is speci�ed as a set of functions and call-backs that a client uses to interface with the CM. Itspeci�es functions for managing state, for performingdata transmissions, for applications to inform the CMof losses, for querying the CM about network state, andfor constructing and splitting macro ows if the defaultper-destination aggregation is unsuitable for an appli-cation. The CM API is discussed in detail in [3], whichpresents the design rationale for the Congestion Man-ager. Here we provide an overview of the API and adiscussion of those features which changed during thetransition from simulation to implementation.

2.1.1 State management

All CM applications call cm open() before using theCM, passing the source and destination addresses andtransport-layer port numbers, in the form of a structsockaddr. The original CM API required only a des-tination address, but the source address speci�cationwas necesary to handle multihomed hosts. cm open re-turns a ow identi�er (cm flowid), which is used as ahandle for all future CM calls. Applications may callcm mtu(cm flowid) to obtain the maximum transmis-sion unit to a destination. When a ow terminates, theapplication should call cm close(cm flowid).

2.1.2 Data transmission

There are three ways in which an application can usethe CM to transmit data. These allow a variety of adap-tation strategies, depending on the nature of the clientapplication and its software structure.

(i) Bu�ered send. This API uses a conventionalwrite() or sendto() call, but the resulting datatransmission is paced by the Congestion Manager.We use this to implement a generic congestion-controlled UDP socket (without content adapta-tion), useful for bulk transmissions that do not re-quire TCP-style reliability or �ne-grained controlover what data gets sent at a given point in time.

(ii) Request/callback. This is the preferred modeof communication for adaptive senders that arebased on the ALF (Application-Level Fram-ing [11]) principle. Here, the client doesnot send data via the CM; rather, it callscm request(cm flowid) and expects a noti�-cation via the cmapp send(cm flowid) callbackwhen this request is granted by the CM, at whichtime the client transmits its data. This approachputs the sender in �rm control of deciding what totransmit at a given time, and allows the sender toadapt to sudden changes in network performance,which is hard to do in a conventional bu�eredtransmission API. The client callback is a grant

for the ow to send up to MTU bytes of data.Each call to cm request() is an implicit requestfor sending up to MTU bytes, which simpli�es theinternal implementation of the CM. This API isideally suited for an implementation of TCP, sinceit needs to make a decision at each stage aboutwhether to retransmit a segment or send a newone. In the implementation, the cmapp send call-back now provides the client with the ID of the ow that may transmit. To allow for client pro-gramming exibility, the client may now specifyits callback function via cm register send().

(iii) Rate callback. A self-timed application trans-mitting on a �xed schedule may receive callbacksfrom the CM notifying it when the parameters ofits communication channel have changed, so thatit can change the frequency of its timer loop orits packet size. The CM informs the client of therate, round-trip time, and packet loss rate for a ow via the cmapp update() callback. During im-plementation, we added a registration function,cm register update() to select the rate callbackfunction, and the cm thresh(down,up) function:If the rate reduces by a factor of down or increasesby a factor of up, the CM calls cmapp update().This transmission API is ideally suited for stream-ing layered audio and video applications.

2.1.3 Application noti�cations

One of the goals of our work was to investigate aCM implementation that requires no changes at thereceiver. Performing congestion management requiresfeedback about transmissions: TCP provides this feed-back automatically; some UDP applications may needto be modi�ed to do so, but without any system-wide changes. Senders must then inform the CMabout the number of sent and received packets, typeof congestion loss if any, and a round-trip time sam-ple using the cm update( cm flowid, nsent, nrecd,lossmode, rtt) function. The CM distinguishes be-tween \persistent" congestion as would occur on a TCPtimeout, versus \transient" congestion when only onepacket in a window is lost. It also allows congestionto be noti�ed using Explicit Congestion Noti�cation(ECN) [32], which uses packet markings rather thandrops to infer congestion.

To perform accurate bookkeeping of the congestionwindow and outstanding bytes for a macro ow, theCM needs to know of each successful transmission fromthe host. Rather than encumber clients with reportingthis information, we modify the IP output routine tocall cm notify(cm flowid, nsent) on each transmis-sion. (The IP layer obtains the cm flowid using a well-de�ned CM interface that takes the ow parameters(addresses, ports, protocol �eld) as arguments.) How-ever, if a client decides not to transmit any data upon acmapp send() callback invocation, it is expected to call

Page 5: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

cm notify(dst, 0) to allow the CM to permit someother ows on the macro ow to transmit data.

2.1.4 Querying

If a client wishes to learn about its (per- ow) avail-able bandwidth and round-trip time, it can use thecm query() call, which returns these quantities. Thisis especially useful at the beginning of a stream whenclients can make an informed decision about the dataencoding to transmit (e.g., a large color or smaller grey-scale image).

2.2 libcm: The CM library

The CM library provides users with the convenience ofa callback-based API while separating them from thedetails of how the kernel to user callbacks are imple-mented. While direct function callbacks are convenientand eÆcient in the same address space, as is the casewhen the kernel TCP is a client of the CM, callbacksfrom the kernel to user code in conventional operatingsystems are more diÆcult. A key decision in the imple-mentation of libcmwas choosing a kernel/user interfacethat maximizes portability, and minimizes both perfor-mance overhead and the diÆculty of integration withexisting applications. The resulting internal interfacebetween libcm and the kernel is:

1. select() on a single per-application CM controlsocket. The write bit indicates that a ow maysend data, and the exception bit indicates thatnetwork conditions have changed.

2. Perform an ioctl to extract a list of all ow IDswhich may send, or to receive the current networkconditions for a ow.

Note that client programs of the CM do not seethis interface; they see only the standard cm * func-tions provided by libcm. The use of sockets or signalsdoes change the way the application's event handlingloop interacts with libcm; after passing the socket intolibcm, the library performs the appropriate ioctls andthen calls back into the application.

2.2.1 Implementation alternatives

We considered a number of mechanisms with which toimplement libcm. In this section, we discuss our rea-sons for choosing the control-socket+select+ioctl ap-proach.

While much research has focused on reducing thecost of crossing the user/kernel boundary (extensiblekernels in SPIN [7], fast, generic IPC in Mach [5], etc.)many conventional operating systems remain limitedto more primitive methods for kernel-to-user noti�ca-tion, each with their own advantages and disadvan-tages. While functionality like the Mach port set-based

IPC would be ideal for our purposes, pragmatically weconsidered four common mechanisms for kernel to usercommunication: Signals, system calls, semaphores, andsockets. A discussion of the merits of each follows.

Signals have several immediate drawbacks. First,if the CM were to appropriate an existing signal forits own use, it might con ict with an application us-ing the same signal. Avoiding this con ict would re-quire the standardization of a new signal type, a pro-cess both slow and of questionable value, given the ex-istence of better alternatives. Second, the cost to anapplication to receive a signal is relatively high, andsome legacy applications may not be signal-safe. Whilethe new POSIX 1003.1b [17] soft realtime signals allowdelivering a 32-bit quantity with a signal, applicationswould need to follow up a signal with a system call toobtain all of the information the kernel wished to de-liver, since multiple ows may become ready at once.For these reasons, we consider mandating the use ofsignals the wrong course for implementing the kernelto user callbacks. However, we provide an option forprocesses to receive a SIGIO when their control socketstatus changes, akin to POSIX asynchronous I/O.

System calls that block do not integrate well withapplications that already have their own event loop,since without polling, applications cannot wait on theresults of multiple system calls. A system call is ableto return immediately with the data the user needs,but the impediments it poses to application integrationare large. System calls would work well in a threadedenvironment, but this presupposes threading support,and the select-based mechanism we describe below canbe used in a threaded system without major additionaloverhead.

Semaphores su�er from the immediate drawbackthat they are not commonly used in network applica-tions. For an application that uses semop on an ar-ray of semaphores as its event loop, a CM semaphoremight be the best implementation avenue, for many ofthe same reasons that we chose sockets for network-adaptive applications. However, most network appli-cations use socket sets instead of semaphore sets, andsockets have a few other bene�ts, which we discuss next.

Sockets provide a well-de�ned and exible inter-face for applications in the form of the select() sys-tem call, though they have a downside similar to that ofsignals: an application wishing to receive a noti�cationvia a socket in a non-blocking manner must select()on the socket, and then perform a system call to obtaindata from the socket. However, a select-based inter-face meshes well with many network applications thatalready have a select-loop based architecture. Utiliz-ing a control socket also helps restrict the code changescaused by the CM to the networking stack.

Finally, we decided to use a single control socketinstead of one control socket per ow to avoid unnec-essary overhead in applications with large numbers ofopen socket descriptors, such as select()-based web-

Page 6: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

servers and caches. Because some aspects of select scalelinearly with the number of descriptors, and many op-erating systems have limits on the number of open de-scriptors, we deemed doubling the socket load for high-performance network applications a bad idea.

2.2.2 Extracting data from the socket

Select provides noti�cation that \some event" has oc-cured. In theory, 7 di�erent events could be sent byabusing the read, write, and exception bits, but ap-plications need to extract more information than this.The CM provides two types of callbacks. Generallyspeaking, the �rst is a \permission to send" callbackfor a particular ow. To maintain even distributionof bandwidth between ows, a loose ordering shouldbe preserved with these messages, but exact orderingis unimportant provided no ows are ignored until theapplication receives further updates (thereby starvingthe ows). If multiple permission noti�cations occur,the application should receive all of them so it cansend data on all available ows. The second callbackis a \status changed" noti�cation. If multiple statuschanges occur before the application obtains this datafrom the kernel, then only the current status matters.

The weak ordering and lack of history promptedus to choose an ioctl-based query instead of a reador message queue interface, minimizing the state thatmust be maintained in the kernel. Status updates sim-ply return the current CM-maintained network stateestimate, and \who can send" queries perform a select-like operation on the ows maintained by the kernel,requiring no extra state, instead of a potentially expen-sive per-process message queue or data stream. Return-ing all available ows has an added bene�t of reducingthe number of system calls that must be made if several ows become ready simultaneously.

3 Engineering Network-adaptive Appli-

cations

In this section, we describe several di�erent classes ofapplications, and describe the ways those applicationscan make use of the CM. We explore two in-kernelclients, and several user-space data server programs,and examine the task of integrating each with the CM.

3.1 Software Architecture Issues

Typical network applications fall into one of several cat-egories:

� Data-driven: Applications that transmit prespec-i�ed data, such as a single �le, then exit.

� Synchronous event-driven: Self-timed data deliv-ery servers, like streaming audio servers.

� Asynchronous event-driven: File servers (http,ftp) and other network-clocked applications.

The CM library provides several options for adap-tive applications that wish to make use of its services:

1. Data-driven applications may use the bu�ered APIto eÆciently pace their data transmissions.

2. An application may operate in an entirelycallback-based manner by allowing libcm to pro-vide its own event loop, calling into the applica-tion when ows are ready. This is most useful forapplications coded with the CM in mind.

3. Signal-driven applications may request a SIGIOnoti�cation from the CM when an event occurs.

4. Applications with select-based event loops cansimply add the CM control socket into their selectset, and call the libcm dispatcher when the socketis ready. Rate-clocked applications (or polling-based applications) can perform a similar non-blocking select test on the descriptor when theyawaken to send data, or, if they sleep, can re-place the sleep with a timed blocking select call.

5. Applications may poll the CM on their own sched-ule.

The remainder of this section describes how par-ticular clients use di�erent CM APIs, from the low-bandwidth vat audio application, to the performance-critical kernel TCP implementation. Note that allUDP-based clients must implement application leveldata acknowledgements in order to make use of the CM.

3.2 TCP

We implemented TCP as an in-kernel CM client.TCP/CM o�oads all congestion control to the CM,while retaining all other TCP functionality (connectionestablishment and termination, loss recovery and pro-tocol state handling). TCP uses the request/callbackAPI as low-overhead direct function calls in the sameprotection domain. This gives TCP the tight controlit needs over packet scheduling. For example, whilethe arrival of a new acknowledgement typically causesTCP to transmit new data, the arrival of three dupli-cate ACKs causes TCP to retransmit an old packet.

Connection creation. When TCP creates a newconnection via either accept (inbound) or connect(outbound), it calls cm open() to associate the TCPconnection with a CM ow. Thereafter, the pac-ing of outgoing data on this connection is controlledby the CM. When application data becomes avail-able, after performing all the non-congestion-relatedchecks (e.g., the Nagle algorithm [40], etc.) data isqueued and cm request() is called for the ow. Whenthe CM scheduler schedules the ow for transmission,

Page 7: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

the cmapp send() routine for TCP is called. Thecmapp send() for TCP transmits any retransmissionfrom the retransmission queue. Otherwise, it transmitsthe data present in the transmit socket bu�er by send-ing up to one maximum segment size of data per call.Finally, the IP output routine calls cm notify() whenthe data is actually sent out.

TCP input. The TCP input routines now feed-back to the CM. Round trip time (RTT) sample col-lection is done as usual using either RFC 1323 times-tamps [19] or Karn's algorithm [21] and is passed to CMvia cm update(). The smoothed estimates of the RTT(srtt) and round-trip time deviation are calculated bythe CM, which can now obtain a better average by com-bining samples from di�erent connections to the samereceiver. This is available to each TCP connection viacm query(), and is useful in loss recovery.

Data acknowledgements. On arrival of an ACKfor new data, the TCP sender calls cm update() to in-form the CM of a successful transmission. Duplicate ac-knowledgements cause TCP to check its dupack count(dup acks). If dup acks < 3, then TCP does noth-ing. If dup acks == 3, then TCP assumes a simple,congestion-caused packet loss, and calls cm update toinform the CM. TCP also enqueues a retransmission ofthe lost segment and calls cm request(). If dup acks> 3, TCP assumes that a segment reached the receiverand caused this ACK to be sent. It therefore callscm update(). Unlike duplicate ACKs, the expirationof the TCP retransmission timer noti�es the sender ofa more serious batch of losses, so it calls cm updatewith the CM LOST FEEDBACK option set to signifythe occurrence of persistent congestion to the CM. TCPalso enqueues a retransmission of the lost segment andcalls cm request().

TCP/CM Implementation. The integration ofTCP and the CM required less than 100 lines of changesto the existing TCP code, demonstrating both the exi-bility of the CM API and the low programmer overheadof implementing a complex protocol with the Conges-tion Manager.

3.3 Congestion-controlled UDP sockets

The CM also provides congestion-controlled UDP sock-ets. They provide the same functionality as standardBerkeley UDP sockets, but instead of immediately send-ing the data from the kernel packet queue to lower lay-ers for transmission, the bu�ered socket implementationschedules its packet output via CM callbacks. When aCM UDP socket is created, it is bound to a particu-lar ow. When data enters the packet queue, the ker-nel calls cm request() on the ow associated with thesocket. When the CM schedules this ow for transmis-sion, it calls udp ccappsend() in the CM UDP mod-ule. This function transmits one MTU from the packetqueue, and requests another callback if packets remain.The in-kernel implementation of the CM UDP API adds

no data copies or queue structures, and supports allstandard UDP options. Modifying existing applicationsto use this API requires only providing feedback to theCM, and setting a socket option on the socket.

A typical client of the CM UDP sockets will behaveas follows, after its usual network socket initialization:

flow = cm_open(dst, port)setsockopt(flow, ..., CM_BUF)loop:<send data on flow><receive data acknowledgements>cm_update(flow, sent, received, ...)

3.4 Streaming Layered Audio and

Video

Streaming layered audio or video applications that havea number of discrete rates at which they can transmitdata are well-served by the CM rate callbacks. Insteadof requiring a comparatively expensive noti�cation foreach transmission, these applications are instead noti-�ed only in the rare event that their network condi-tions change signi�cantly. Layered applications opentheir usual UDP socket, and call cm open() to obtaina control socket. They operate in their own clockedevent loop while listening for status changes on eithertheir control socket or via a SIGIO signal. They usecm thresh() to inform the CM about network changesfor which they should receive callbacks.

3.5 Real-time Adaptive Applications

Applications that desire last-minute control over theirdata transmission (i.e. those that do not want anybu�ering inside the kernel) use the request callbackAPI provided by the CM. When given permission totransmit via the cmapp send() callback from the CM,they may use cm query() to discover the current net-work conditions and adapt their content based on that.Other servers may simply wish to send the most up-to-date content possible, and so will defer their datacollection until they know they can send it. The roughsequence of CM calls that are made to achieve this inthe application are:

flow = cm_open(dst)cm_request(flow)<receive cmapp_send() callback from libcm>cm_query(flow, ...)<send data><receive data acks>cm_update(flow, sent, lost, ...)

Other options exist for applications that wish to ex-ploit the unique nature of their network utilization toreduce the overhead of using the services of the Conges-tion Manager. We discuss one such option below in themanner in which we adapted the vat interactive audioapplication to use the CM.

Page 8: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

64K Audio

recvd/lost Policer

App.Buffer

KernelBufferData

CM

Acks

rate

Network

Figure 2. The adaptive vat architecture

3.6 Interactive Real-time Audio

The vat application provides a constant bit-rate sourceof interactive audio. Its inability to downsample its au-dio reduces the avenues it has available for bandwidthadaptation. Therefore, the best way to make vat behavein a network-friendly and backwards compatible man-ner is to preemptively drop packets to match the avail-able network bandwidth. There are, of course, compli-cations. Network applications experience two types ofvariation in available network bandwidth: long termvariations due to changes in actual bandwidth, andshort term variations due to the probing mechanismsof the congestion control algorithm. Short-term varia-tion is typically dealt with by bu�ering. Unfortunately,bu�ering, especially FIFO bu�ering with drop-tail be-havior, the de-facto standard for kernel bu�ers and net-work router bu�ers, can result in long delay and signif-icant delay variation, both of which are detrimental tovat's audio quality. Vat , therefore, needs to act like anALF application, managing its own bu�er space withdrop-from-head behavior when the queue is full.

The resulting architecture is detailed in �gure 2.The input audio stream is �rst sent to a policer, whichprovides long-term adaptation via preemptive packetdropping. The policer outputs into the application levelbu�er, which can be con�gured in various sizes anddrop policies. This bu�er feeds into the kernel bu�eron-demand as packets are available for transmission.

4 Evaluation

This section describes several experiments that quantifythe costs and bene�ts of our CM implementation. Ourexperiments show that using the CongestionManager inthe kernel has minimal costs, and that even the worst-case overhead of the request/callback user-space API isacceptably small.

The tests were performed on the Utah NetworkTestbed [22] using 600MHz Intel Pentium III proces-sors, 128MB PC100 ECC SDRAM, and Intel EtherEx-press Pro/100B Ethernet cards, connected via 100Mbps

50

100

150

200

250

300

350

400

450

500

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 5

Thr

ough

put (

Kby

tes/

s)

Packet Loss Rate (%)

TCP/CMTCP/Linux

Figure 3. Comparing throughput vs. loss forTCP/CM and TCP/Linux. Rates are for a 10Mbpslink with a 60ms RTT.

Ethernet through an Intel Express 510T switch, withDummynet channel simulation. CM tests were run onLinux 2.2.9, with Linux and FreeBSD clients.

To ensure the proper behavior of a ow, the con-gestion control algorithm must behave in a \TCP-compatible" [8] manner. The CM implements a TCP-style window-based AIMD algorithm with slow start.It shares bandwidth between eligible ows in a round-robin manner with equal weights on the ows.

Figure 3 shows the throughput achieved bythe Linux TCP implementation (TCP/Linux) andTCP with congestion control performed by the CM(TCP/CM). The linux kernel against which we com-pare has two algorithmic di�erences from the Conges-tion Manager: It starts its initial window at 2 packets,and it assumes that each ACK is for a full MTU. TheCongestion Manager instead performs byte-counting forits AIMD algorithm. The �rst issue is Linux-speci�c,and the last is a feature of the CM.

4.1 Kernel Overhead

To measure the kernel overhead, we measured theCPU and throughput di�erences between the optimizedTCP/Linux and TCP/CM. The midrange machinesused in our test environment are suÆciently powerfulto saturate a 100Mbps Ethernet with TCP traÆc.

There are two components to the overhead imposedby the congestion manager: The cost of performing ac-counting as data is exchanged on a connection, and aone-time connection setup cost for creating CM datastructures. A microbenchmark of the connection es-tablishment time of a TCP/CM vs. TCP/Linux indi-cates that there is no appreciable di�erence in connec-tion setup times.

We used long (megabytes to gigabytes) connectionswith the ttcp utility to determine the long-term costs

Page 9: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

11390

11400

11410

11420

11430

11440

11450

11460

11470

11480

1000 10000 100000 1e+06

Thr

ough

put (

Kby

tes/

seco

nd)

Buffers transmitted

Congestion ManagerNative Linux TCP

Figure 4. 100Mbps TCP throughput comparison.Note that the absolute di�erence in the worst case be-tween the Congestion Manager and the native TCP isonly 0.5% and that the Y axis begins at 11 megabytesper second.

imposed by the congestion manager. The impact of theCM on extremely long term throughput was negligi-ble: in a 1 gigabyte transfer, the congestion managerachieved identical performance (91.6 Mbps) as nativeLinux. On shorter runs, the throughput of the CM di-verged slightly from that of Linux, but only by 0.5%.The throughput rates are shown in �gure 4. The dif-ference is due to the CM using an initial window of 1MTU and Linux using 2 MTU, not CPU overhead.

Because both implementations are able to saturatethe network connection, we looked at the CPU uti-lization during these transmissions to determine thesteady-state overhead imposed by the Congestion Man-ager. In �gure 5 we see that the CPU di�erence be-tween TCP/Linux and TCP/CM converges to slightlyless than 1%.

4.2 User-space API Overhead

The overhead incurred by our adaptation API occursprimarily because the applications must process theirACKs in user-space instead of in the kernel. Therefore,these programs incur extra data copies and user/kernelboundary crossings. To quantify this overhead, ourtest programs sent packets of speci�ed sizes on a UDPsocket, and waited for acknowledgement packets fromthe server. We compare these programs to a webserver-like TCP client which sendt data to the server, andperformed a select() on its socket to determine if theserver has sent any data back. To facilitate compari-son, we disabled delayed ACKs for the one TCP test toensure that our packet counts were identical.

Figure 6 shows the wall-clock time required to sendand process the acknowledgement for a packet, based ontransmitting 200,000 packets. For comparison, we in-

0

20

40

60

80

100

1000 10000 100000 1e+06

% C

PU

util

izat

ion

Buffers transmitted

Congestion ManagerNative Linux TCP

Figure 5. CPU overhead comparison betweenTCP/Linux and TCP/CM. For long connections, theCPU overhead converges to slightly under 1% for theunoptimized implementation of the CM.

ALF/noconnect 1 cm notify (ioctl)ALF 1 cm request (ioctl)

1 extra socketBu�ered 1 recv, 2 gettimeofdayTCP/CM {baseline{

Table 1. Cumulative sources of overhead for di�erentAPIs using the Congestion Manager relative to sendingdata with TCP.

clude TCP statistics as well, where the TCP programsset the maximum segment size to achieve identical net-work performance. The \nodelay" variant is TCP with-out delayed acks. The tests were run on a 100Mbpsnetwork on which no losses occured.

Table 1 breaks down the sources of overhead for us-ing the di�erent APIs. Using the CM with UDP re-quires that applications compute the round-trip-time(RTT) of their packets, requiring a system call togettimeofday, and requires that they process theirACKs in user-space, requiring a system call to recv andthe accompanying data copy into their address space.The ALF API further requires that the application ob-tain an additional control socket and select upon it,and that it make an explicit call to cm request beforetransmitting data. Finally, if the kernel is unable to de-termine the ow to which to charge the transmission, aswith an unconnected UDP socket, the application mustexplicitly call cm notify

These test cases represent the worst-case behaviorof serving a single high-bandwidth client, because noaggregation of requests to the CM may occur. The CMprograms can achieve similar reductions in processingtime by using delayed acks, so the real API overheadcan be determined by comparing the ALF/noconnectcase to the TCP/CM case. For 168 byte packets,

Page 10: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

0

20

40

60

80

100

120

0 200 400 600 800 1000 1200 1400

Mic

rose

cond

s pe

r pa

cket

Packet size (bytes)

ALF/noconnectALF

BufferedTCP/CM nodelay

TCP/CMTCP/Linux

Figure 6. API throughput comparison on a 100Mbpslink. The worst-case throughput reduction incurredby the CM is 25% from TCP/CM nodelay toALF/noconnect.

ALF/noconnect results in a 25% reduction in through-put relative to TCP without delayed ACKs.

4.3 Bene�ts of Sharing

One bene�t of integrating congestion information withthe CM is immediately clear. A client that sequen-tially fetches �les from a webserver with a new TCPconnection each time loses its prior congestion infor-mation, but with concurrent connections with the CM,the server is able to use this information to start subse-quent connections with more accurate congestion win-dows. Figure 7 shows a test we performed across thevBNS between MIT and the University of Utah, wherean unmodi�ed (non-CM) client performed 9 retrievalsof the same 128k �le with a 500ms delay between re-trievals, resulting in a 40% improvement in the transfertime for the later requests. (Other �le sizes and delaysyield similar results, so long as they overlap. The ben-e�ts are comparatively greater for smaller �les). TheCM requires an additional RTT ( 75ms) for the �rsttransfer, because Linux sets its initial congestion win-dow to 2 MTUs instead of 1. This pattern of mul-tiple connections is still quite common in webserversdespite the adoption of persistent connections: Manybrowsers open 4 concurrent connections to a server, andmany client/server combinations do not support persis-tent connections. Persistent connections [28] providesimilar performance bene�ts, but su�er from their owndrawbacks, which we discuss in section 6.

4.4 Adaptive Applications

In this section, we demonstrate some of the networkadaptive behaviors enabled by the CM.

350

400

450

500

550

600

650

700

750

800

850

0 2 4 6 8 10

Mili

seco

nds

to c

ompl

ete

requ

est

Client Request #

TCP/CMTCP/Linux

Figure 7. Sharing TCP state: The client requeststhe same �le 9 times with a 500ms delay between re-quest initiations. By sharing congestion informationand avoiding slow-start, the CM-enabled server is ableto provide faster service for subsequent requests, despitea smaller initial congestion window.

As noted earlier, applications that require tightcontrol over data scheduling use the request/callback(ALF) API, and are noti�ed by the CM as soon as theycan transmit data. The behavior of an adaptive layer-ing application run across the vBNS using this API isshown in �gure 8. This application chooses a layer totransmit based upon the current rate, but sends pack-ets as rapidly as possible to allow its client to bu�ermore data. We see that the CM is able to provide suÆ-cient information to the application to allow it to adaptproperly to the network conditions.

For self-clocked applications that base their trans-mitted data upon the bandwidth to the client (such asconventional layered audio servers), the CM rate call-back mechanism provides a low-overheadmechanism foradaptation, and allows clients to specify threshholdsfor the noti�cation callbacks. Figure 9 shows appli-cation adaptation using rate callbacks for a connectionbetween MIT and the University of Utah. Here, the ap-plication decides which of the four layers it should sendbased on noti�cations from the CM about rate changes.

From �gures 8 and 9, we see from the increased oscil-lation rate in the transmitted layer that the ALF appli-cation is more responsive to smaller changes in availablebandwidth, whereas the rate callback application reliesoccasionally on short-term kernel bu�ering for smooth-ing. There is an overhead vs. functionality trade-o�in the decision of which API to use, given the higheroverhead of the ALF API, but applications face a moreimportant decision about the behavior they desire.

Some applications may be concerned about the over-head from receiver feedback. To mitigate this, an ap-plication may delay sending feedback; we see this in aminor and in exible way with TCP delayed acks. In

Page 11: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

0

500

1000

1500

2000

2500

0 5 10 15 20 25

Rat

e (in

KB

ps)

Time (in sec)

Rate Callback application using CM

Transmission RateRate reported by CM

Figure 8. Bandwidth perceived by an adaptive layeredapplication using the request callback (ALF) API.

0

500

1000

1500

2000

2500

0 2 4 6 8 10 12 14 16 18 20

Rat

e (in

KB

ps)

Time (in sec)

Rate Callback application using CM

Transmission RateRate reported by CM

Figure 9. Bandwidth perceived by an adaptive layeredapplication using the rate callback API.

�gure 10, we see that delaying feedback to the CMcauses burstiness in the reported bandwidth. Here, thefeedback by the receiver was delayed by min(500 acks,2000ms). The initial slow start is delayed by 2s wait-ing for the application, then the update causes a largerate change. Once the pipe is suÆciently full, 500 ackscome relatively rapidly, and the normal, though bursty,non-timeout behavior resumes.

5 Discussion

We have shown several bene�ts of integrated ow man-agement and the adaptation API, and have explored thedesign features that make the API easy to use. This sec-tion describes an optimization useful for busy servers,and discusses some drawbacks and limitations of thecurrent CM architecture.

0

500

1000

1500

2000

2500

3000

0 10 20 30 40 50 60 70

Rat

e (in

KB

ps)

Time (in sec)

Rate Callback app using CM with delayed feedbacks (min(500packets,2s))

Transmission RateRate reported by CM

Figure 10. Adaptive layered application using ratecallback API with delayed feedback

Optimizations. Servers with large numbers of con-current clients are often very sensitive to the overheadcaused by multiple kernel boundary crossings. To re-duce this overhead, we can batch several sockets intothe same cm request call with the cm bulk requestcall, and likewise for query, notify, and update calls.

By multiplexing control information for many sock-ets on each CM call, the overhead from kernel crossingsis mitigated at the expense of managing more compli-cated data structures for the CM interface. Bulk query-ing is already performed in libcm when multiple owsare ready during a single ioctl to determine which ows can send data, but this completes the interface.

Trust issues. Because our goal was an architec-ture that did not require modi�cations to receivers, wedevised a system where applications provide feedbackusing the cm update() call. The consequence of this isthat there is a potential for misuse, due to bugs or mal-ice. For example, the CM client could repeatedly mis-inform the CM about the absence of congestion alonga path and obtain higher bandwidth. This does notincrease the vulnerability of the Internet to such prob-lems, because such abuse is already trivial. More im-portant are situations where users on the same machinecould potentially interfere with each other. To preventthis, the Congestion Manager would need to ensure thatonly kernel-mediated (e.g. TCP) ows belonging to dif-ferent users can belong in the same macro ow. Ourcurrent implementation does not make an attempt toprovide this protection. Savage [37] presents severalmethods by which a malicious receiver can defeat con-gestion control. The solutions he proposes can be easilyused with the CM; we have already implemented byte-counting to prevent ACK division.

Macro ow construction. When di�erentiatedservices, or any system which provides di�erent serviceto ows between the same pair of hosts, start being de-ployed, the CM would have to reconsider the default

Page 12: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

choice of a macro ow. We expect to be able to gainsome bene�t by including the IP di�erentiated-services�eld in deciding the composition of a macro ow.

Finally, we observe that remote LANs are not oftenthe bottleneck for an outside communicator. As sug-gested in [42, 36] among others, aggregating congestioninformation about remote sites with a shared bottleneckand sharing this information with local peers may bene-�t both users and the network itself. A macro ow maythus be extended to cover multiple destination hostsbehind the same shared bottleneck link. EÆciently de-termining such bottlenecks remains an open researchproblem.

Limitations. The current CM architecture is de-signed only to handle unicast ows. The problem ofcongestion control for multicast ows is a much morediÆcult problem which we deliberately avoid. UDP ap-plications using the CM are required to perform theirown loss detection, requiring potential additional appli-cation complexity. Implementing the Congestion Man-ager protocol discussed in [3] would eliminate this need,but remains to be studied.

6 Related work

Designing adaptive network applications has been anactive area of research for the past several years. In1990, Clark and Tennenhouse [11] advocated the useof application-level framing (ALF) for designing net-work protocols, where protocol data units are chosenin concert with the application. Using this approach,an application can have a greater in uence over decid-ing how loss recovery occurs than in the traditional lay-ered approach. The ALF philosophy has been used withgreat bene�t in the design of several multicast transportprotocols including the Real-time Transport Protocol(RTP) [38], frameworks for reliable multicast [14, 33],and Internet video [24, 35].

Adaptation APIs in the context of mobile informa-tion access were explored in the Odyssey system [26].Implemented as a user-level module in the NetBSD op-erating system, Odyssey provides API calls by whichapplications can manage system resources, with upcallsto applications informing them when changes occur inthe resources that are available. In contrast, our CMsystem is implemented in-kernel since it has to manageand share resources across applications (e.g., TCP) thatare already in-kernel. This necessitates a di�erent ap-proach to handling application callbacks. In addition,the CM approach to measuring bandwidth and othernetwork conditions is tied to the congestion avoidanceand control algorithms, as compared to the instrumen-tation of the user-level RPC mechanism in Odyssey.We believe that our approach to providing adaptationinformation for bandwidth, round-trip time, and lossrate complements Odyssey's management of disk space,CPU, and battery power.

The CM system uses application callbacks or up-calls as an abstraction, an old idea in operating systems.Clark describes upcalls in the Swift operating system,where the motivation is a lower layer of a protocol stacksynchronously invoking a higher-layer function across aprotection boundary [9]. The Mach system used the no-tion of ports, a generic communication abstraction forfast inter-process communication (IPC). POSIX speci-�es a standard way of passing \soft real-time signals"that can be used to send a noti�cation to a user-levelprocess, but it restricts the amount of data that can becommunicated to a 32-bit quantity.

Event delivery abstractions for mobile computinghave been explored in [1], where \monitored" eventsare tracked using polling and \triggered" events (e.g.,PC card insertion) are noti�ed using IPC. This workde�nes a language-level mechanism based on C++ ob-jects for event registration, delivery, and handling. Thissystem is implemented in Mach using ports for IPC.

Our approach is to use a select() call on a con-trol socket to communicate information between kerneland user-level. The recent work of Banga et al. [4] toimprove the performance of this type of event deliverycan be used to further improve our performance.

The Microsoft Winsock implementation is largelycallback-based, but here callbacks are implemented asconventional function calls since Winsock is a user-levellibrary within the same protection boundary as the ap-plication [31]. The main reason we did not implementthe CM as a user-level daemon was because TCP is al-ready implemented in-kernel in most UNIX operatingsystems, and it is important to share network informa-tion across TCP ows.

Quality-of-service (QoS) interfaces have been ex-plored in several operating systems, including Neme-sis [16]. Like the exokernel approach [20] and SPIN [7],Nemesis enables applications to perform as much of theprocessing as possible on their own using application-speci�c policy, supported by a set of operating systemabstractions di�erent from those in UNIX. WhereasNemesis treats local network-interface bandwidth as theresource to be managed, we take a more end-to-end ap-proach of discovering the end-to-end performance to dif-ferent end-hosts, enabling sharing across common net-work paths. Furthermore, the API exported by Nemesisis useful for applications that can make resource reser-vations, while the CM API provides information aboutnetwork conditions. Some \web switches" [?] providetraÆc shaping and QoS based upon application infor-mation, but do not provide integrated ow managementor feedback to the applications creating the data.

Multiple concurrent streams can cause problems forTCP congestion control. First, the ensemble of owsprobes more aggressively for bandwidth than a single ow. Second, upon experiencing congestion along thepath, only a subset of the connections usually reducetheir window. Third, these ows do not share any in-formation between each other. While we propose a gen-

Page 13: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

eral solution to these problems, application-speci�c so-lutions have been proposed in the literature. Of partic-ular importance are approaches that multiplex severallogically distinct streams onto a single TCP connectionat the application level, including Persistent-connectionHTTP (P-HTTP [28], part of HTTP/1.1 [12]), the Ses-sion Control Protocol (SCP) [39], and the MUX pro-tocol [15]. Unfortunately, these solutions su�er fromtwo important drawbacks. First, because they areapplication-speci�c, they require each class of applica-tions (Web, real-time streams, �le transfers, etc.) to re-implement much of the same machinery. Second, theycause an undesirable coupling between logically di�er-ent streams: if packets belonging to one stream are lost,another stream could stall even if none of its packetsare lost because of the in-order \linear" delivery forcedby TCP. Independent data units belonging to di�erentstreams are no longer independently processible and theparallelism of downloads is often lost.

7 Conclusion

The CM system enables applications to obtain an un-precedented degree of control over what they can doin response to di�erent network conditions. It incorpo-rates robust congestion control algorithms, freeing eachapplication from having to re-implement them. It ex-poses a rich API that allows applications to adapt theirtransmissions at a �ne-grained level, and allows the ker-nel and applications to integrate congestion informationacross ows.

Our evaluation of the CM implementation showsthat the callback interface is e�ective for a variety of ap-plications, and does not unduly burden the programmerwith restrictive interfaces. From a performance stand-point, the CM itself imposes very little overhead; thatwhich remains is mostly due to the unoptimized natureof our implementation. The architecture of programsimplemented using UDP imposes some additional over-head, but the cost of using the CM after this architec-tural conversion is quite small.

Many systems exist to deliver content over the In-ternet using TCP or home-grown UDP protocols. Webelieve that by providing an accessible, robust frame-work for congestion control and adaptation, the Con-gestion Manager can help improve both the implemen-tation and performance of these systems.

The Congestion Manager implementation for Linuxis available from our web page, http://nms.lcs.mit.edu/projects/cm/.

Acknowledgements

We would like to thank Suchitra Raman and Alex Sno-eren for their helpful feedback and suggestions, and theFlux research group at the University of Utah for pro-

viding their network testbed3. Finally, we would liketo thank our shepherd, Peter Druschel, and the anony-mous reviewers for their numerous helpful comments.This work was supported by grants from DARPA(Grant No. MDA972-99-1-0014), NSF, IBM, Intel, andNTT Corporation.

References

[1] Badrinath, B. R., and Welling, G. Event DeliveryAbstraction for Mobile Computing. Tech. Rep. LCSR-TR-242, Rutgers University, 1995.

[2] Balakrishnan, H., Padmanabhan, V. N., Seshan,S., Stemm, M., and Katz, R. TCP Behavior of aBusyWeb Server: Analysis and Improvements. In Proc.IEEE INFOCOM (San Francisco, CA, Mar. 1998).

[3] Balakrishnan, H., Rahul, H. S., and Seshan, S.An Integrated Congestion Management Architecturefor Internet Hosts. In Proc. ACM SIGCOMM (Sep1999).

[4] Banga, G., Mogul, J. C., and Druschel, P. A scal-able and explicit event delivery mechanism for unix. InProc. of the USENIX 1999 Annual Technical Confer-ence (June 1999).

[5] Barrera, J. S. A fast Mach network IPC implementa-tion. In Proc. of the Second USENIX Mach Symposium(Nov. 1991), pp. 1{12.

[6] Berners-Lee, T., Fielding, R., and Frystyk, H.Hypertext Transfer Protocol{HTTP/1.0. Internet En-gineering Task Force, May 1996. RFC 1945.

[7] Bershad, B. N., Savage, S., Pardyak, P., Sirer,E. G., Fiuczynski, M. E., Becker, D., Chambers,C., and Eggers, S. Extensibility, safety, and perfor-mance in the SPIN operating system. pp. 267{284.

[8] Braden, B., Clark, D., Crowcroft, J., Davie,B., Deering, S., Estrin, D., Floyd, S., Jacobson,V., Minshall, G., Partridge, C., Peterson, L.,Ramakrishnan, K., Shenker, S., Wroclawski, J.,and Zhang, L. Recommendations on Queue Manage-ment and Congestion Avoidance in the Internet. Inter-net Engineering Task Force, Apr 1998. RFC 2309.

[9] Clark, D. The Structuring of Systems Using Up-calls. In Proceedings of the 10th ACM Symposium onOperating Systems Principles (SOSP '85) (Dec. 1985),pp. 171{180.

[10] Clark, D. The Design Philosophy of the DARPAInternet Protocols. In Proc. ACM SIGCOMM (Aug.1988).

[11] Clark, D., and Tennenhouse, D. ArchitecturalConsideration for a New Generation of Protocols. InProc. ACM SIGCOMM (September 1990).

3 Supported by NSF grant ANI-00-82493, DARPA/AFRLgrant F30602-99-1-0503 and Cisco.

Page 14: details · Bansal, Doroth y Curtis, Sriniv asan Seshan, Hari Balakrishnan M.I.T. L ab or atory for Computer Scienc e Cambridge, MA 02139 f dga, bansal, dcurtis, srini, hari g @lcs.mit.edu

[12] Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,and Berners-Lee, T. Hypertext Transfer Protocol|HTTP/1.1. Internet Engineering Task Force, Jan 1997.RFC 2068.

[13] Floyd, S., and Fall, K. Promoting the Use of End-to-End Congestion Control in the Internet. IEEE/ACMTrans. on Networking 7, 4 (Aug. 1999).

[14] Floyd, S., Jacobson, V., McCanne, S., Liu, C. G.,and Zhang, L. A Reliable Multicast Framework forLight-weight Sessions and Application Level Framing.In Proc. ACM SIGCOMM (Sept. 1995).

[15] Gettys, J. MUX protocol speci�cation, WD-MUX-961023. http://www.w3.org/pub/WWW/Protocols/MUX/WD-mux-961023.html, 1996.

[16] I. Leslie and D. McAuley and R. Black and T.Roscoe and P. Barham and D. Evers and R. Fair-bairns and E. Hyden. The design and implementa-tion of an operating system to support distributed mul-timedia applications. IEEE Journal on Selected Areasin Communications 14, 7 (September 1996), 1280{1297.

[17] Institute of Electrical and Electronics Engi-neers, Inc. IEEE Standard for Information Technol-ogy | Portable Operating System Interface (POSIX)| Part 1: System Application Programming Interface(API) | Amendment 1: Realtime Extension [C Lan-guage], 1994. Std 1003.1b-1993.

[18] Jacobson, V. Congestion Avoidance and Control. InProc. ACM SIGCOMM (Aug 1988).

[19] Jacobson, V., Braden, R., and Borman, D. TCPExtensions for High Performance. Internet EngineeringTask Force, May 1992. RFC 1323.

[20] Kaashoek, M. F., Engler, D. R., Ganger, G. R.,Brice~no, H. M., Hunt, R., Mazi�eres, D., Pinck-ney, T., Grimm, R., Jannotti, J., and Mackenzie,K. Application performance and exibility on exokernelsystems. In Proceedings of the 16th ACM Symposium onOperating Systems Principles (SOSP '97) (Saint-Malo,France, October 1997), pp. 52{65.

[21] Karn, P., and Partridge, C. Improving Round-TripTime Estimates in Reliable Transport Protocols. ACMTransactions on Computer Systems 9, 4 (Nov. 1991),364{373.

[22] Lepreau, J., Alfeld, C., Andersen, D., andMaren, K. V. A large-scale network testbed. Un-published, in 1999 SIGCOMMworks-in-progress. http://www.cs.utah.edu/flux/testbed/, Sept. 1999.

[23] The MASH Project Home Page. http://www-mash.cs.berkeley.edu/mash/, 1999.

[24] McCanne, S., Jacobson, V., and Vetterli, M.Receiver-driven Layered Multicast. In Proc ACM SIG-COMM (Aug. 1996).

[25] Microsoft Windows Media Player. http://www.microsoft.com/windows/mediaplayer/.

[26] Noble, B., Satyanarayanan, M., Narayanan, D.,Tilton, J., Flinn, J., and Walker, K. AgileApplication-Aware Adaptation for Mobility. In Proc.16th ACM Symposium on Operating Systems Princi-ples (Oct 1997).

[27] Padmanabhan, V. Addressing the Challenges of WebData Transport. PhD thesis, Univ. of California, Berke-ley, Sep 1998.

[28] Padmanabhan, V. N., and Mogul, J. C. ImprovingHTTP Latency. In Proc. Second International WWWConference (Oct. 1994).

[29] Postel, J. B. User Datagram Protocol. Internet En-gineering Task Force, August 1980. RFC 768.

[30] Postel, J. B. Transmission Control Protocol. InternetEngineering Task Force, September 1981. RFC 793.

[31] Quinn, B., and Shiute, D. Windows Sockets NetworkProgramming. Addison-Wesley, Jan. 1999.

[32] Ramakrishnan, K., and Floyd, S. A Proposal toAdd Explicit Congestion Noti�cation (ECN) to IP. In-ternet Engineering Task Force, Jan 1999. RFC 2481.

[33] Raman, S., and McCanne, S. Scalable Data Namingfor Application Level Framing in Reliable Multicast. InProc. ACM Multimedia (Sept. 1998).

[34] Real Networks. http://www.real.com/.

[35] Rejaie, R., Handley, M., and Estrin, D. RAP: AnEnd-to-end Rate-based Congestion Control Mechanismfor Realtime Streams in the Internet. In Proc. IEEEINFOCOM (March 1999).

[36] Savage, S., Cardwell, N., and Anderson, T. TheCase for Informed Transport Protocols. In Proc. 7thWorkshop on Hot Topics in Operating Systems (HotOSVII) (Mar 1999).

[37] Savage, S., Cardwell, N., Wetherall, D., andAnderson, T. TCP Congestion Control with a Mis-behaving Receiver. In ACM Computer Comm. Review(Oct 1999).

[38] Schulzrinne, H., Casner, S., Frederick, R., andJacobson, V. RTP: A Transport Protocol for Real-Time Applications. Internet Engineering Task Force,Jan 1996. RFC 1889.

[39] Spero, S. Session Control Protocol (SCP).http://www.w3.org/pub/WWW/Protocols/HTTP-NG/http-ng-scp.html, 1996.

[40] Stevens, W. R. TCP/IP Illustrated, Volume 1.Addison-Wesley, Reading, MA, Nov 1994.

[41] Tan, W., and Zakhor, A. Real-time Internet VideoUsing Error Resilient Scalable Compression and TCP-friendly Transport Protocol. IEEE Trans. on Multime-dia (May 1999).

[42] Touch, J. TCP Control Block Interdependence. Inter-net Engineering Task Force, April 1997. RFC 2140.