Making a Case for Proactive Flow Control in Optical …Making a Case for Proactive Flow Control in...

Making a Case for Proactive Flow Control in

Optical Circuit-Switched Networks

M. Kumar1, V. Chaube1, P. Balaji2, W. Feng1, and H.-W. Jin3

1 Dept. of Computer ScienceVirginia Tech

{mithil,vineetac,feng}@cs.vt.edu

2 Mathematics and Computer Science DivisionArgonne National Laboratory

[email protected]

3 Dept. of Computer Sc. and Engg.Konkuk [email protected]

Abstract. Optical circuit-switched networks such as National Lamb-daRail (NLR) offer dedicated bandwidth to support large-scale bulkdata transfer. Though a dedicated circuit-switched network eliminatescongestion from the network itself, it effectively “pushes” the congestionto the end hosts, resulting in lower-than-expected throughput. Previousapproaches either use an ad-hoc proactive approach that does not gener-alize well or a sluggish reactive approach where the sending rate is onlyadapted based on synchronous feedback from the receiver.We address the shortcomings of such approaches using a two-step pro-cess. First, we improve the adaptivity of the reactive approach by propos-ing an asynchronous, fine-grained, rate-based approach. While this ap-proach enhances performance, its limitation is that it is still reactive.Consequently, we then analyze the predictive patterns of load on the re-ceiver and provide strong evidence that a proactive approach is not onlypossible, but also necessary, to achieve the best performance in dynami-cally varying end-host conditions.

Key words: rate-based protocol, circuit switched, optical networks,LambdaGrid

1 Introduction

Rapid advances in optical networking are producing increases in bandwidth thathave outpaced the geometric increase in semiconductor chip capacity, as pre-dicted by the Moore’s law. This means that the burden is now on the end pointsof communication networks to process the high bandwidth data being deliv-ered by the network. This trend is more prominent in circuit-switched opticalnetworks, like those found in LambdaGrids [12, 4].

2 Making a Case for Proactive Flow Control in OCS Networks

A LambdaGrid is a distributed grid of resources that consists of dedicatedhigh-bandwidth optical networks, computing clusters, and data repositories.Such a distributed supercomputer will enable scientists and engineers to analyze,correlate, and visualize extremely large and remote datasets on-demand and inreal time. The dedicated high-bandwidth optical networks found in LambdaGridstranslate into no internal network congestion and result in pushing congestionto the end hosts. Thus, the efficiency of a network protocol at an end host isgreatly influenced by its ability to adapt its transmission rate to dynamic net-work and endpoint conditions, thereby minimizing packet loss and maximizingnetwork throughput.

Currently, the computational power of an end host is not sufficient to han-dle the high network bandwidth that is available in a dedicated circuit-switchednetwork. This puts a cap on the maximum throughput that can be achieved.Additionally, the computational power per process is effectively reduced if thereare several processes running on the end host that are competing for CPU re-sources. Hence, we need to understand end-host contention and come up witheffective rate-adaptation techniques, which are critical when networks are fastenough to push congestion to the end hosts.

In this paper, we address the shortcomings of existing approaches in twosteps. First, we present an asynchronous, fine-grained, rate-control approach thatsolves some performance issues but is still reactive in nature. Next, we presentour observations from a series of experiments and analyze the predictive patternsof load on the receiver node. Based on our analysis, we provide evidence that aproactive approach is possible and required in such environments to achieve thebest performance in dynamically varying end-host conditions.

2 Background

A rate-based approach is one where a constant sending rate is negotiated betweena sender and receiver. Rate-based protocols perform well for high bandwidth-delay product networks. Reliable Blast UDP (RBUDP) [7] is an example of sucha protocol in which the sender transmits data using UDP packets at a ratespecified by the user. At the end of data transmission, the receiver sends anacknowledgment via a bitmap of missing packets. The sender then re-transmitsthe missing packets. The process continues until all packets are received. Themechanism is aggressive and provides reliability, but it is not adaptive to packetloss.

RAPID [1] is an end-host aware, rate-adaptive protocol, where rate adapta-tion is based on proactive feedback from the receiver. The receiver is monitoredby a soft real-time process that attempts to guess the time and duration whenthe receive process is context-switched and replaced by another process. It theninforms the sender, which suspends transmission to avoid packet loss when thereceiver process is context-switched and suspended. Although being a proactiveapproach, a major drawback of this protocol is that it is difficult to predict theexact time when the receive process will be suspended. It is even more difficult to

Making a Case for Proactive Flow Control in OCS Networks 3

match the sender’s transmission suspension with the receive process’ suspend in-terval, especially because the two nodes are physically separated by a reasonableamount of network delay [2]. Moreover, stopping data transmission completelyis a drastic step to take when we aim at keeping the network utilization high.

Another RBUDP variant, RBUDP+ [3], uses the same scheme but a differentestimate of the time and duration for which the receive process has been resched-uled. Like RAPID, the prediction of time and duration is almost impossible.

An enhancement of RAPID and RBUDP+ is RAPID+ [2], which focuseson dynamically monitoring the packet loss at the receiving end host, so that itcan be used to adapt the sending rate. Accurate prediction of when packet losswould occur increases performance and circuit utilization. However, in this case,rate adaptation is initiated only after most of the losses have already occurred.

The Simple Available Bandwidth Utilization Library (SABUL) [6] is a rate-based as well as window-based protocol that has been designed for data-intensiveapplications over a shared network. Its delay and window-based congestion con-trol makes it TCP-friendly but brings along the same characteristics of TCPthat make it inefficient when congestion has been pushed to the endpoints.

The Group Transport Protocol (GTP) [15] is a receiver-driven protocol thatperforms well in multipoint-to-point and multipoint-to-multipoint environmentsand ensures fairness between different connections on the same end host. Likeother rate-based protocols, SABUL and GTP wait for packet loss to occur beforeproviding feedback to the sender to transmit with revised sending rates.

TCP does not perform well for large bandwidth-delay product networks, asfound in LambdaGrids, because of the overhead involved in congestion and flowcontrol. In order to improve TCP performance, significant research has beendone to improve TCP congestion control [16, 13, 10, 9]. Other complementaryresearch has been done to improve flow control [11, 14, 5]. However, none of theseimprovements or variants of TCP were designed for networks with nearly zerocongestion. In this paper, we use UDP because it is lighter and faster and canbe enhanced to provide reliability without worrying about network congestion.

3 Asynchronous Fine-Grained Rate Control

In this section, we introduce our fine-grained, rate-based control protocol calledASYNCH and compare its performance with existing rate-based protocols.

3.1 Basic Idea

Many existing rate-based protocols such as RAPID+, SABUL and GTP usereactive rate control to avoid end-host congestion, where the sending rate isvaried after an event, such as a packet drop, occurs. Further, these approachesuse a coarse-grained feedback mechanism. Specifically, they utilize multi-roundcommunication; in the first round, the sender attempts to send data at themaximum rate. If there are dropped packets in the second round, these dropped


packets are sent at a slower pace. These rounds continue till all the data hasbeen successfully communicated.

While such an approach is simple, it has several performance implications.First, a coarse-grained feedback approach, such as that described above, has ahigh overhead of inaccuracy. For example, if a receiver is only capable of receivingdata at 5 Gbps, a sender transmitting a 10-GB file at 10 Gbps would end updropping half the packets (5 GB). The sender, however, would receive feedbackabout its high sending rate only at the end of the first communication round,i.e., after the entire file has been sent out. In the second round, the remaining5 GB has to be retransmitted. Thus, the inaccurate sending rate (in this case,10 Gbps) can result in a major loss of performance, which is expected to furtherworsen as the amount of data being communicated increases.

Second, a reactive approach is fundamentally limited in its ability to handledynamic environments where the receiving end-host is executing other processes.By the time the sender receives feedback for rate adaptation, the receiver’s ca-pability to receive data might have already changed.

In order to analyze the behavior of the receiver end-host and its dynam-ics, we implemented an asynchronous, fine-grained, reactive rate-control proto-col named ASYNCH. This protocol, while still reactive, addresses the issue ofcoarse-grained feedback; it asynchronously sends feedback to the sender at regu-lar intervals of time, instead of waiting for the entire file to be transferred beforesending the feedback. This mechanism allows feedback to be independent of thefile size. On the other hand in synchronous feedback, such as in RAPID+, thereceiver sends feedback only at the end of each communication round.

In ASYNCH, upon receiving feedback, the sender calculates the current lossrate. If an increase in loss rate (compared to the last calculated value) is de-tected, the sending rate is decreased. The sender considers at least two feedbackmessages from the receiver for deciding what kind of rate adaptation is required.If a high loss rate has been observed in only one round, it is treated as a tempo-rary loss, and no action is taken. If the receiver does not experience any packetloss for k successive rounds, then the sender will increase its sending rate. Thisreactive rate adaptation is expected to reduce any further loss at the receiver.

3.2 Performance Evaluation

We evaluate the performance of ASYNCH and RAPID+ for various conditionson the receiver end-host and analyze the results. We first show how through-put degrades when the sending rate increases beyond the receiver’s capacity toreceive. We then explain the role played by the round-trip time (RTT) latencyand the receiver’s socket buffer size on packet loss rate and network throughput.

Methodology The testbed is a three-node sender-receiver setup with the mid-dle node acting as a wide-area network (WAN) emulator using Netem [8]. A filesize of 1GB, RTT of 56ms and MTU of 9000 bytes was used for all experiments,unless otherwise specified. The choice for this setup and configuration is based


on our attempt to emulate a real dedicated circuit-switched, long-fat network.In order to obtain consistent results, we bind the receive process to the samecore for our experiments. The system configuration is summarized in Table 1.

Table 1. System Configuration

Processor Dual-core AMD Opteron 2218

Cache Size 1024 KB

RAM 4 GB

Network Adaptor Myrinet 10Gb

Kernel 2.6.18

High Network Speed One of the primary reasons for packet drops is thediscrepancy in the protocol processing requirements for data sending and re-ceiving. Specifically, data transmission with TCP/IP or UDP/IP is typically alighter weight operation compared to data reception, owing to various optimiza-tions such as integrated checksum and copy and the lack of data demultiplexingrequirements that are necessary on the receiver side. In our experiments with a10 Gbps network, the sender, for example, is able to transmit packets at 7 Gbps.However, the receiver is only able to receive data at 5.5 Gbps.

0

1000

2000

3000

4000

5000

6000

1087.4 2174.8 3261.4 4484.2 5518 6520.4 7030.5

Sending rate (Mbps)

Throughput (Mbps)

RAPID+ Throughput (Mbps) ASYNCH Throughput (Mbps)

Fig. 1. Network throughput degrades after the sending rate reaches an optimal point

Figure 1 demonstrates this behavior. For sending rates less than 5.5 Gbps,the achieved throughput is about the same as the sending rate, and there is nopacket loss. Beyond 5.5 Gbps, however, there is a sharp rise is the packet lossrate, resulting in a decline in throughput. Unlike RAPID+, ASYNCH utilizesa fine-grained feedback mechanism to adapt its rate quickly resulting in up to


58% better throughput as compared to RAPID+. For a sending rate of 7 Gbps,the loss rate for ASYNCH is only 9.5% compared to the 21.49% for RAPID+.

For the rest of the experiments, we are interested in studying the capabilitiesof the receiver end-host; therefore we use a peak sending rate of 5.5 Gbps.

Socket Buffer Size of the Receiver UDP/IP communication takes placethrough socket buffers. Data received is stored in the socket buffer until the ap-plication reads it. Thus, while a large socket buffer can provide more toleranceto a receiver’s slow receiving capability, it can result in memory wastage. Ac-cordingly, an optimal buffer size needs to be chosen to balance memory wastageand packet loss.

Figure 2 shows the effect of the socket buffer size on loss rate. For verysmall buffer sizes (10-100 KB), substantial packet loss occurs, resulting in poorthroughput. For buffer sizes larger than 100 KB, ASYNCH’s loss rate drops toabout 5-7% or less. While the loss rate for RAPID+ decreases with the socketbuffer size as well, we notice that ASYNCH’s loss rate is much smaller than thatof RAPID+ for all buffer sizes.

0

10

20

30

40

50

60

70

0.01 0.03 0.05 0.1 0.2 0.3 0.4 0.5 1 1.5 2 2.5

Buffer Size (MB)

Lossrate (%)

RAPID+ Lossrate (%) Asynch Lossrate (%)

Fig. 2. Effect of socket buffer size on loss rate

Round-Trip Time Unlike TCP, in the case of UDP data transfers, RTT doesnot play a significant role in UDP throughput due to the lack of congestion- andflow- control mechanisms in UDP. However, to achieve reliability, both RAPID+and ASYNCH use a TCP control channel; the sender waits for packet-loss feed-back from the receiver. This feedback mechanism, however, directly depends onthe RTT and causes the throughput to drop with increasing RTT. Figure 3 illus-trates this relationship. That is, for both RAPID+ and ASYNCH, the through-put drops by about 65 Mbps for every 20-ms increase in RTT.


4700

4800

4900

5000

5100

5200

5300

5400

5500

5600

0.1 20 40 60 80 100 120 140 160

RTT (ms)

Throughput (Mbps)

RAPID+ ASYNCH

Fig. 3. Effect of RTT on throughput

4 A Case for Proactive Rate Control

Here we analyze the receiver end-host in environments where a number of com-peting processes are scheduled on the receiver node.

4.1 Effect of Load on the Receiver End-Host

In environments where a number of competing processes are scheduled, the re-ceiver’s network process must compete with the other processes on the end hostfor CPU resources. The competing processes may be of various types and canbe I/O-bound or CPU-bound.

We use four types of loads for our experiments. The first is a purely compu-tational workload that runs entirely in user space. The second is a system-callload that reads and writes data to a flat file. The third is a memory-intensiveload that reads data from a 1GB buffer. Finally, we have a network load thatacts as a forwarder for packets received by the receiver process. We use a four-node setup; two of the nodes perform the actual communication, one node actsas a delay node emulating a high-latency network, and the fourth node acts asa gateway, where the receive and forward processes are running. The gatewaynode performs some computation on every received packet and then forwards itto the final recipient of the data.

With the exception of network load, all loads are running on the same pro-cessor core as the receive process. For network load, we run the receiver onP1C2 (Processor 1, Core 2) and all forwarder threads on P1C1 since cores onthe same processor share cache and memory. This helps obtain maximum end-to-end throughput.

Figures 4 and 5 compare the loss rate and throughput of RAPID+ andASYNCH with increasing load, respectively. For the purely computational load,ASYNCH quickly adapts its sending rate, thus avoiding any severe packet loss.


0

5

10

15

20

25

30

35

0 1 2 3 4

Number of loads

Lossrate (%)

RAPID+ Computational load ASYNCH Computational load RAPID+ System call load

ASYNCH System call load RAPID+ Memory load ASYNCH Memory load

RAPID+ Network load ASYNCH Network load

Fig. 4. Effect of various loads on loss rate

However, neither the loss rate nor throughput has any fixed pattern with in-creasing load.

For the system-call load, the average loss rate is much higher for both ASYNCHand RAPID+. This is attributed to the higher priority assigned by the operatingsystem to system-call workloads. Consequently, fewer CPU resources are allot-ted to the receiver process, resulting in an increased loss rate. On the otherhand, the memory-intensive load demonstrates behavior that is better than thesystem-call workload, but worse than the compute workload.

For the network load, the throughput, as seen by receiver process, is notaffected significantly by the increase in number of forwarder threads, mainlybecause the forwarder is running on a separate core. However, the end-to-endthroughput (not shown in figure) reduces, since the forwarder threads, all run-ning on the same core, compete amongst themselves for CPU resources.

3000

3500

4000

4500

5000

5500

6000

0 1 2 3 4

Number of loads

Throughput (Mbps)

RAPID+ Computational load ASYNCH Computational load RAPID+ System call load

ASYNCH System call load RAPID+ Memory load ASYNCH Memory load

RAPID+ Network load ASYNCH Network load

Fig. 5. Effect of various loads on throughput


4.2 Loss Patterns

Figure 6 shows the pattern of packet loss when data is transmitted with rateadaptation disabled. We disable rate adaptation in this experiment in order toobserve the loss patterns that would help us understand if the current reactiveapproaches are able to adapt precisely during intervals of packet losses.

0

10

20

30

40

50

60

70

80

90

100

54631

1208

1820

2397

2974

3586

4163

4740

5420

5997

7409

8266

9931

10508

11128

11705

12282

12902

13479

14206

15046

15623

16243

16821

17398

Time (ms)

Lossrate (%)

Computational load Network load

Fig. 6. Loss patterns for computation load and network load

The sharp spikes for pure computational load reveal that all packets in a smallinterval of time are lost. Similar spikes were observed for system-call load andmemory-intensive loads, although not shown in the figure. When rate adaptationis enabled, ASYNCH and RAPID+ interpret the spike as a heavy loss rateand wrongly adapt the sending rate. In general, the presence of spikes in anyloss pattern is likely to convey a wrong signal to the reactive rate-adaptationalgorithm. The loss pattern for a network load does not have any spikes andflattens on the top. Thus, the reactive rate adaptation is expected to work forthis case as the future loss pattern is likely to remain the same as the currentloss pattern.

4.3 Reactive versus Proactive Approach

Based on the loss patterns described above, it is clear that a reactive approach isnot suitable to adapt the rate, based on dynamic conditions at the receiver endhost. A reactive approach will work if the conditions at the receiver are static,resulting in steady loss patterns, e.g., loss patterns due to another network loadon the receiver. On the other hand, a proactive approach can potentially predictthe time when a loss event will occur and take necessary action to preventpacket drops. However, designing a proactive approach is a difficult problemthat requires understanding of the way an operating system scheduler handlesprocesses.


Fig. 7. Timeslice consumption of processes showing intervals when each of them iscurrently scheduled for execution

Figure 7 shows the timeslices awarded to the receive process and a memory-intensive load process. The exact times when packet loss occurs have beenmarked in the figure. These points are always located in the same time in-terval when the load process is running, depriving the network process of CPUresource and thus causing it to drop packets. In other words, packet losses occurexclusively because the receive process gets rescheduled and is replaced by thecompeting load process.

4.4 A Proactive Approach

In designing a proactive approach, we need to estimate in advance when the re-ceive process will be rescheduled and replaced by another process for execution.We will refer to this time as context-switch time, since this rescheduling essen-tially involves a context-switch. There are two approaches for predicting when acontext-switch for the receive process will occur, as discussed below.

Polling Dynamic Priority Polling the dynamic priority of the receive processto estimate the context-switch time has been used by Banerjee et al. [1]. Theidea is to note the dynamic priority of the receive process when a loss actuallyoccurred. During polling, if the dynamic priority reaches one less than the previ-ously noted value, the sender is notified to suspend transmission for an amountproportional to the average sleep time of the process. However, as seen in Fig-ure 8 for a memory-intensive load, the dynamic priority of the receive processtakes only three values and just prior to getting context switched, there are nochanges to the dynamic priority value. This behavior has been observed for otherkinds of loads as well and therefore this approach is not reliable.


0

200

400

600

800

1000

1200

1 54 107 160 213 266 319 372 425 478 531 584 637 690 743 796 849 902 955 1008

Time (ms)

Sleep Average (ms)

100

105

110

115

120

125

130

135

140

Dynamic Priority

Sleep Average Dynamic Priority

Fig. 8. Sleep average and dynamic priority for a memory access intensive process

Polling Sleep Average Figure 8 shows the average sleep time of the receiveprocess. We see that the average sleep time of a process follows a saw-toothpattern. A context-switch happens whenever the average sleep time has reacheda local minimum. Because of this uniform and periodic pattern, it is possible toestimate the time of an upcoming context-switch.

If at any instance of time, we have the maximum and minimum average sleep

time (given by MAX SLEEP AV G and MIN SLEEP AV G), we know thatthe process started its execution with its average sleep time = MAX SLEEP AV G

and will be rescheduled when its average sleep time has reached MIN SLEEP AV G.We take an action when the average sleep time reaches (MAX SLEEP AV G

+ MIN SLEEP AV G)/2. We must constantly update the maximum and min-imum values, since they are likely to change.

5 Conclusion and Future Work

In this paper, we presented an asynchronous, feedback-based, reactive, rate-control protocol called ASYNCH that features a fine-grained rate-control mech-anism. Our protocol effectively solves some of the problems faced by currentrate-based protocols that adapt sending rate, leading to accurate rate adapta-tion and therefore higher throughput. We also analyzed the end-host behavior indynamic environments and made a case for a proactive protocol, which is moresuitable for handling such environments.

References

1. A. Banerjee, W. Feng, B. Mukherjee, and D. Ghosal. RAPID: An End-SystemAware Protocol for Intelligent Data Transfer over LambdaGrids. 20th IEEE In-ternational Parallel and Distributed Processing Symposium, pages 10 pp. –, 2006.


2. P. Datta, W. Feng, and S. Sharma. End-System Aware, Rate-Adaptive Protocolfor Network Transport in LambdaGrid Environments. ACM/IEEE Supercomputing2006, 2006.

3. P. Datta, S. Sharma, and W. Feng. A Feedback Mechanism for Network Schedulingin LambdaGrids. IEEE International Symposium on Cluster Computing and theGrid, 2006.

4. T. DeFanti, C. Laat, J. Mambretti, K. Neggers, and B. Arnaud. TransLight: AGlobal-Scale LambdaGrid for e-Science. Communications of the ACM, pages 34–41, 2003.

5. M. Fisk and W. Feng. Dynamic adjustment of tcp window sizes. Los AlamosUnclassified Report 00-3321, 2000.

6. R.L. Grossman, M. Mazzucco, H. Sivakumar, Y. Pan, and Q. Zhang. SimpleAvailable Bandwidth Utilization Library for High-Speed Wide Area Networks. J.Supercomput. (Netherlands), 34(3):231 – 242, 2005.

7. E. He, J. Leigh, O. Yu, and T.A. Defanti. Reliable Blast UDP: Predictable HighPerformance Bulk Data Transfer. IEEE International Conference on Cluster Com-puting, pages 317 – 24, 2002.

8. S. Hemminger. Network Emulation with NetEm. In Australia’s National LinuxConference (LCA ’05), 2005.

9. U. Hengartner, J. Bolliger, and T. Gross. TCP Vegas Revisited. In IEEE INFO-COM, volume 3, pages 1546–1555 vol.3, 26-30 Mar 2000.

10. D. Katabi, M. Handley, and C. Rohrs. Congestion Control for High Bandwidth-Delay Product Networks. Comput. Commun. Rev. (USA), 32(4):89 – 102, 2002.

11. J. Semke, J. Mahdavi, and M. Mathis. Automatic TCP Buffer Tuning. In ACMSIGCOMM ’98, pages 315–323, New York, NY, USA, 1998.

12. N. Taesombut and A.A. Chien. Distributed Virtual Computers (DVC): Simplifyingthe Development of High Performance Grid Applications. IEEE InternationalSymposium on Cluster Computing and the Grid, 2004., pages 715–722, 19-22 April2004.

13. K. Tan, J. Song, Q. Zhang, and M. Sridharan. A Compound TCP Approach forHigh-Speed and Long Distance Networks. 25th IEEE International Conference onComputer Communications., 25:1–12, 2006.

14. E. Weigle and W. Feng. Dynamic Right-Sizing: A Simulation Study. ComputerCommunications and Networks, 10:152–158, 2001.

15. R. X. Wu and A. A. Chien. GTP: Group Transport Protocol for LambdaGrids. InIEEE International Symposium on Cluster Computing and the Grid, pages 228–238, 2004.

16. L. Xu, K. Harfoush, and I. Rhee. Binary Increase Congestion Control (BIC) forFast Long-Distance Networks. volume vol. 4, pages 2514 – 24, Hong Kong, China,2004.

The submitted manuscript has been created by UChicago Argonne, LLC, Operator of

Argonne National Laboratory (“Argonne”). Argonne, a U.S. Department of Energy Of-

fice of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The

U.S. Government retains for itself, and others acting on its behalf, a paid-up, nonex-

clusive, irrevocable worldwide license in said article to reproduce, prepare derivative

works, distribute copies to the public, and perform publicly and display publicly, by or

on behalf of the Government.

Making a Case for Proactive Flow Control in Optical …Making a Case for Proactive Flow Control in...

Documents

Transcript of Making a Case for Proactive Flow Control in Optical …Making a Case for Proactive Flow Control in...