Reservation based packet buffers with deterministic packet departures

Reservation-Based Packet Bufferswith Deterministic Packet Departures

HaoWang, Student Member, IEEE, and Bill Lin,Member, IEEE

Abstract—High-performance routers need to temporarily store a large number of packets in response to congestion. DRAM is typically

needed to implement large packet buffers, but the worst-case random access latencies of DRAM devices are too slow to match the

bandwidth requirements of high-performance routers. Existing DRAM-based architectures for supporting linespeed queue operations

can be classified into two categories: prefetching-based and randomization-based. They are all based on interleaving memory

accesses across multiple parallel DRAM banks for achieving higher memory bandwidths, but they differ in their packet placement and

memory operation scheduling mechanisms. In this paper, we describe novel reservation-based packet buffer architectures with

interleaved memories that take advantage of the known packet departure times to achieve simplicity and determinism. The number of

interleaved DRAM banks required to implement the proposed packet buffer architectures is independent of the number of logical

queues, yet the proposed architectures can achieve the performance of an SRAM implementation. Our reservation-based solutions are

scalable to growing packet storage requirements in routers while matching increasing line rates.

Index Terms—Buffer memories, packet switching, random access memories, deterministic algorithms

Ç

1 INTRODUCTION

HIGH-PERFORMANCE routers need to temporarily store alarge number of packets at each linecard in response

to congestion. One of the most difficult challenges in design-ing high-performance routers is the implementation ofpacket buffers that operate at extremely fast line rates. Thatis, at every time slot, a newly arriving packet may be writtento the packet buffer, and an existing packet may be readfrom the packet buffer for departure. To provide sufficientpacket buffering at times of congestion, a common buffersizing rule is B ¼ RTT� C, where B is the size of thebuffer, RTT is the average round-trip-time of a flow passingthrough the link, and C is the line rate [3]. At an averageround-trip-time of 250 ms on the Internet [4] and a line rateof 40 Gb/s (OC-768), this rule of thumb translates to a mem-ory requirement of over 1:25 GB at each linecard, whichclearly makes an SRAM implementation impractical. Whilesome research shows that this rule is an overestimate on thetotal buffer size [5] under certain traffic assumption, otherresults such as [6] show that even larger buffers may berequired under different traffic patterns. The state-of-the-artcommercial SRAM holds 36 Mb [7], with a random accesslatency below 4 ns and power consumption of 1.8 W. Apacket buffer using only SRAM would require 285 SRAMdevices and consume approximately 513 W of power.Moreover, the number of SRAM devices required and thepower consumption grow linearly as the line rate grows.On the other hand, DRAMs have much larger capacity and

consume significantly less power. The state-of-the-art com-mercial DRAM has 4 Gb of storage [8]. It has a randomaccess latency of 36 ns and consumes only 1.76 W of power.A packet buffer using only DRAM would require merelythree DRAM devices and consume approximately 5.28 W ofpower. However, the worst-case random access times ofDRAM are not fast enough to match line rates at 40 Gb/s orbeyond, making a na€ıve DRAM-based solution inadequate.Therefore, researchers have explored prefetching-based [9],[10] and randomization-based [11], [12] memory systemsthat aim to provide the memory access speeds of SRAM,but with the density of DRAM. These architectures can beutilized to implement a packet buffer with logically separateFIFO queues.

For example, virtual output queues (VOQs) are used incrossbar routers, where each VOQ corresponds to a logicalFIFO queue for buffering packets to a particular outputport. To enable the random servicing of VOQs at SRAMspeeds, the prefetching-based solutions described in [9],[10] employ hybrid SRAM/DRAM designs. These architec-tures support linespeed queue operations by aggregatingand prefetching packets for parallel DRAM transfers usingfast SRAM caches. However, these architectures requirecomplex memory management algorithms for real-timesorting and a substantial amount of SRAM for caching thehead and tail portions of logically separated queues to han-dle worst-case access patterns. On the other hand, randomi-zation-based architectures [11], [12] are based on a randomplacement of packets so that the memory loads across theDRAM banks are balanced. While effective, these architec-tures only provide statistical guarantees.

Although the assumption of servicing VOQs at randomorder is reasonable for crossbar-based router architectures,this assumption is unnecessarily general for several impor-tant router architectures. In particular, deterministicpacket service models exist for those architectures where

� The authors are with the Department of Electrical and ComputerEngineering, University of California, San Diego, La Jolla, CA 92093.E-mail: {wanghao, billlin}@ucsd.edu.

Manuscript received 27 Aug. 2012; revised 5 Mar. 2013; accepted 11 Mar.2013; date of publication 26 Mar. 2013; date of current version 21 Mar. 2014.Recommended for acceptance by J. Flich.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference the Digital Object Identifier below.Digital Object Identifier no. 10.1109/TPDS.2013.89

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 5, MAY 2014 1297

1045-9219� 2013 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.

the departure times of packets from packet buffers can bedeterministically calculated exactly in advance, before thepacket insertion into a packet buffer, when best-effort rout-ing is considered. One such router architecture is theswitch-memory-switch router architecture [13], [14], [15],[16], [17]. This architecture can efficiently mimic an outputqueueing switch. The switch-memory-switch architecture isbased on a set of physically separated packet buffers thatare sandwiched between two crossbar switches. When anew packet arrives at an input port, the departure time ofthe packet is deterministically calculated to exactly mimicthe packet departure order of an output queueing switch. Amatching problem is then solved to select a packet buffer tostore the packet. When the packet gets inserted into a packetbuffer, its departure time has already been decided. Imple-mentations of such router architecture include M-seriesInternet core routers from Juniper Networks [18] and stor-age switches from Brocade [19].

Another router architecture is the load-balanced routerarchitecture [20], [21], [22], [23]. This architecture hasbeen shown to have interesting scalability properties andthroughput guarantees. The load-balanced router archi-tecture is also based on a set of physically separatedpacket buffers that are sandwiched between two switches.However, in contrast to the switch-memory-switch archi-tecture, there is no central scheduler, and the twoswitches have a fixed configuration that is independent ofthe arrival traffic. Instead, in this architecture each packetbuffer maintains a set of VOQs that are serviced deter-ministically in a round-robin order. Therefore, the depar-ture time of a packet from its packet buffer can also becalculated deterministically in advance of writing thepacket into the packet buffer. Recently, with the advance-ment of cloud computing on the Internet, data centersbecome popular hosts for many applications. Load-bal-anced routers have been widely deployed in such datacenters to distribute the traffic across server racks foroptimal resource utilization [22], [23].

In this paper, we propose reservation-based packet bufferarchitectures. Our new architectures require a simple deter-ministic memory management scheme that exploits theknown departure times of packets to achieve the perfor-mance of an SRAM packet buffer implementation, by usingmultiple DRAM modules in an interleaved manner. Withinterleaved memories, a high effective memory bandwidthis achieved by accessingK slower memories independently.Using the known departure times of packets that need to bestored in the packet buffers, we can deterministically assignpackets to interleaved DRAM modules in such a manner sothat access conflicts are avoided and packets are guaranteedto be retrieved before their departure times. Our proposedpacket buffer architectures can support an arbitrary numberof logical FIFO queues with the only assumption that thedeparture times of packets can be determined upon packetarrivals. We prove that the number of interleaved DRAMmodules required is a constant with respect to the number oflogical queues. A simple bypass scheme is employed using asmall amount of SRAM to buffer packets that have departuretimes less than the round-trip latency of writing to and read-ing from aDRAMbank. The size of this SRAM is again a con-stant independent of the number of logical queues. Although

interleaved memories have been used in the previously pro-posed fast packet buffer schemes [9], [10], [11], [12], [24], [25],our proposed techniques take advantage of known depar-ture times to achieve simplicity and determinism.

We will first describe a frame-sized reservation-basedapproach, of which an early summary was presented in[2]. Although this approach is effective, it has one draw-back in that the size of the bitmap grows linearly withthe size of the packet buffer, therefore requiring poten-tially non-trivial amount of SRAM for book-keeping. Toreduce the memory requirement, we further present anefficient block-sized reservation-based packet bufferarchitecture, based on the concept of blocks that supportsdeterministic packet departures, on which a muchshorter paper of the present work was presented in [1].This refined solution aggregates packets into blocks sothat the amount of book-keeping information in SRAMis minimized. In particular, the total size of the reserva-tion table only grows logarithmically with respect to thetotal size of the packet buffer, which results in an orderof magnitude reduction in the SRAM requirement forcurrent implementations. For both schemes, we provethat the required number of interleaved DRAM banks isonly a small constant independent of the arrival trafficpatterns, the number of flows, and the number of prior-ity classes. Therefore, the reservation-based designs arescalable to growing packet storage requirements inrouters while matching increasing line rates. Further-more, we provide a variation of our designs called ran-domized block-based packet buffer that achieves theperformance of randomization-based schemes by sacrific-ing reliability to further reduce the SRAM requirement.

The rest of the paper is organized as follows. In Section 2,we show the related work on memory designs. In Section 3,we describe the packet buffer abstraction. In Section 4, wepresent the architecture and operations of the frame-sizedreservation-based packet buffer. In Section 5, we present thearchitecture and operations of the block-sized reservation-based packet buffer. In Section 6, we present the architec-ture of the randomized block-sized packet buffer. The per-formance analysis of these architectures is presented inSection 7. Finally, we conclude in Section 8.

2 RELATED WORK

The problem of designing memory systems for differentnetwork applications has been addressed in many pastresearch efforts. Several solutions have been offered to buildmemory systems specialized in serving as packet buffers. Inan Internet router, packets buffers must operate at wire-speed and provide bulk storage. As the link speed increases,the sizes of packet buffers have been increasing as well,which makes a direct DRAM implementation impossible.To tackle the problem, several designs of packet buffersbased on hybrid SRAM/DRAM architectures or memoryinterleaved DRAM architectures have been proposed [9],[10], [11], [12], [26], [27], [28], [29]. Simple memory interleav-ing solutions are proposed in [26], [27], [28], where bank-specific techniques are developed. In these schemes, thememory locations are carefully mapped to memory banksusing a certain pseudo-random function, so that when

1298 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 25, NO. 5, MAY 2014

memory accesses are evenly distributed into different mem-ory banks, the number of bank conflicts is minimized. Theeffectiveness of these solutions relies heavily upon how thearriving memory accesses are distributed into memorybanks, which greatly restricts the allowable memory accesspatterns and limits the applications of the schemes. Forexample, in a high speed packet buffer supporting millionsof flows, the order in which packets are retrieved frommemory banks is typically determined by a packet sched-uler implementing a specific queueing policy, which maycause packets to be retrieved in an unbalanced fashion andlead to dramatic system performance degradation.

Advanced memory interleaving schemes share the ideaof utilizing multiple DRAM banks to achieve SRAMthroughput, while reducing [11], [12], [29] or eliminating[9], [10] potential bank conflicts. Previous designs can becategorized into two types. The first type is the prefetching-based packet buffer design [9], [10]. Prefetching-basedarchitectures support linespeed queue operations by aggre-gating and prefetching packets for parallel DRAM transfersusing fast SRAM caches. These packet buffer architecturesassume a very general model where a buffer consists ofmany logically separated FIFO queues that may be accessedin random orders. However, these architectures requirecomplex memory management algorithms based on real-time sorting and a substantial amount of SRAM for cachingthe head and tail portions of logically separated queues tohandle worst-case access patterns. The other type is therandomization-based packet buffer design. Different archi-tectures proposed in [11], [12] are based on random place-ments of packets into different memory banks so thatmemory loads across DRAM banks are balanced. Thesearchitectures also support a very general queueing model inwhich logically separated queues may be accessed in ran-dom order. Various memory management algorithms areproposed to handle packets pending access to a memorybank. In order packet departure across different flows areprovided in [12], while in [11] there is no such guarantee. Ingeneral, while effective, they only provide statistical guar-antees as there are deterministic packet sequences that cancause system to overflow.

In [21], a number of crossbar-based single-buffered(SB) routers are analyzed. The sizes of the middle stagememories for routers with centralized shared memoryare investigated. It is shown that for several such routersto emulate the behavior of a FCFS shared memoryrouter, the total memory bandwidth of the middle stagememories has to be between 2NR and 4NR, where N isthe number of input-output pairs in the router, and R isthe line rate. In this paper, we adapt the Constraint Setsmethod described in [21], which is essentially thepigeonhole principle, for the problem of designing deter-ministic packet buffers inside a single linecard at highline rate for a large number of active flows. More specifi-cally, in this paper we present novel reservation-basedarchitectures where a reservation table serves as a mem-ory management module to select one bank out of allavailable banks to store an arriving packet such that itwill not cause any conflicts at the time of packet arrivalor departure. In order to avoid memory conflicts, thesearchitectures need to use about three times the number

of DRAM banks, which is a moderate increase consider-ing the extremely low price of current DRAM banks.1

Moreover, the number of interleaved DRAM banksrequired to implement the proposed packet buffer archi-tecture is independent of the number of logical queues,yet the proposed architecture can achieve the perfor-mance of an SRAM implementation and scale to growingpacket storage requirements in routers while matchingincreasing line rates.

3 SYSTEM MODEL—PACKET BUFFER

ABSTRACTION

In this section, we describe the packet buffer system model.First, we introduce a packet buffer abstraction. Throughoutthis paper we assume that all incoming variable-size pack-ets are segmented into fixed-size packets/cells, or simplypackets/cells, and reassembled when leaving the router,which is a function supported by most current routers. Wealso assume that time is slotted, where a time slot is definedto be the amount of time that it takes for a (fixed-size)packet/cell to arrive (or depart) at a given line rate. Forexample, at 40 Gb/s the time slot for a 40-byte packet is8 ns. For the rest of the paper, we will simply use packets torepresent fix-sized packets/cells A packet buffer imple-mented in a linecard temporarily store arriving packetsfrom its input ports in order to mitigate congestions on thenetwork links and also provide fast retransmission when apacket is lost on its egress link. DRAM banks are commonlyutilized to provide storage for packet buffers in linecards.However, the DRAM access rate is much lower than theline rate. Our packet buffer takes the advantage of the deter-ministic packet buffer assumption, so that by providing asmall amount of redundant DRAM banks, it is guaranteedthat all arriving packets to a packet buffer can be admittedand all departing packets from a packet buffer can be sentout on time.

The deterministic packet buffer assumption is as follows.At each time slot t, at most one new packet p arrives at thepacket buffer with a future departure time TdðpÞ, and atmost one existing packet q departs from the packet buffer inthe current time slot (i.e., TdðqÞ ¼ t), with both occurring inthe same time slot in the worst case. The departure times ofpackets are assumed to be known and fixed at the time ofpacket arrivals, which is a common assumption in buildinginput queued systems that can emulate output queued sys-tems. We make no restrictions on what departure times areassigned, except that the departure times for different pack-ets are unique to ensure that only one packet departs pertime slot. Like previous work on fast packet buffers [9], [10],[11], [12], our packet buffer consists of an arbitrary numberof logical queues. But instead of assuming random access,we assume deterministic access based on pre-determineddeparture times. We refer to this packet abstraction as thesingle-write-single-read deterministic packet buffer model, ordeterministic packet buffer model for short. A fundamentalproblem for buffers with deterministic departure is thatpackets arriving or departing at about the same time may

1. As of this writing, 2 GB of DRAM (2 Gb/bank � 8 banks) costsunder $20, i.e., over 100 MB/$.

WANG AND LIN: RESERVATION-BASED PACKET BUFFERS WITH DETERMINISTIC PACKET DEPARTURES 1299

need to access memory banks that conflict at write or readtime, since it takes a DRAM bank several time slots to finisha write or read transaction. In our proposed reservation-based packet buffer, we shall prove that with a small num-ber of DRAM banks and an efficient bank selection algo-rithm, all memory conflicts can be resolved.

The stability of a buffer architecture requires that theamount of buffers needed to provide reliable storage isfinite. The stability of the reservation-based packet buf-fers are achieved by ensuring that all the packets departat their pre-determined departure times. Therefore, onlya finite number of packets are stored in the packet bufferuntil their departure times. If the arrival traffic is admis-sible, which means that the packet arrival rate is nolarger than the departure rate, then our deterministicreservation-based packet buffer architectures are stable.In the rare event that the incoming traffic becomes inad-missible by overshooting with a large number of packetsat certain input ports due to oversubscribing burstyarrivals, more DRAM banks are required to buffer theoverloaded traffic at a linecard. It will be evident later inthe paper that if the maximum arrival overshooting fac-tor is Hos, the total number of DRAM banks required atthe line card is increased by a factor of Hos as well.

4 DETERMINISTIC FRAME-SIZED PACKET BUFFER

We next describe the deterministic frame-sized reservation-based packet buffer architecture, as depicted in Fig. 1. In theframe-sized packet buffer, the packets stored in DRAMbanks are managed in groups called frames. The size of aframe is the same as the number of time slots it takes for aDRAM bank to finish a memory transaction.

4.1 Architecture

To provide bulk storage, our architecture uses an inter-leaved memory architecture with multiple slower DRAMmodules. In particular, let there be K memories. We denotethe DRAM banks as D1, D2; . . . ; DK , respectively. DRAMaccess latencies are usually much slower than the line rate.Hence, multiple time slots are required to complete a mem-ory transaction. Let b be the number of time slots required

to complete a single DRAM write or read transaction. Once amemory transaction has been initiated to a DRAM, thatmemory will remain busy for b time slots until the memorytransaction completes. Specifically, if a memory transactionis initiated to a particular memory bank Di at time slot t,then the memory bank Di is said to be busy from time slot tto time slot tþ b� 1. We assume that SRAM can write anarriving packet and read a departing packet in each timeslot, which is a common assumption in packet bufferdesigns. A time interval that covers b time slots is defined asa frame. A time slot t is said to belong to frame f ¼ t

b

� �. For

example, time slots t ¼ 1; 2; . . . ; b belong to frame f ¼ 1,time slots t ¼ bþ 1; bþ 2; . . . ; 2b belong to frame f ¼ 2, etc.Each of theK memory banks is segmented into several cells,with each cell the size of one packet. The jth cell in bank Di

is denoted as CiðjÞ. We define a stripe or a memory frame as

SDRAMðjÞ ¼[ K

i¼1CiðjÞ: (1)

Thus the jth memory frame, which is the collection of thejth cells from all DRAM banks, is of size K packets. Packetswith the same departing time frame will be stored in thesame memory frame.

4.2 Packet Access Conflicts

Please refer to the section packet access conflicts in the sup-plementary file, which can be found on the Computer Soci-ety Digital Library at http://doi.ieeecomputersociety.org/10.1109/TPDS.2013.89.

4.3 Memory Management Implementation forFrame-Sized Design

Please refer to the section Memory Management Implemen-tation for Frame-sized Design in the supplementary file,available online.

5 DETERMINISTIC BLOCK-SIZED PACKET BUFFER

Our frame-size packet buffer in Section 4 suffers from theproblem that the total SRAM required grows linearlywith line rate, which makes it less attractive for higherline rates, such as OC-768. In this section we describe thedeterministic block-sized reservation-based packet bufferarchitecture that minimizes the total SRAM requirement,as shown in Fig. 2.

5.1 Architecture

We use the same notations as in Section 4. Furthermore,in a block-sized packet buffer, each of the K memorybanks is segmented into several sections. One section canhold up to N packets. The jth section in the DRAM bankDi is denoted as DiðjÞ. We define a memory block BDRAMðjÞas follows:

BDRAMðjÞ ¼[ K

i¼1DiðjÞ: (2)

Thus the jth memory block, which is the collection of the jthsections from all DRAM banks, is of size N �K packets.

In the frame-sized scheme, 1 bit is needed to track eachpacket location in DRAM banks. Therefore the total SRAM

Fig. 1. Deterministic frame-sized packet buffer architecture.


requirement grows linearly with the line rate due to thebuffer sizing rule B ¼ RTT� C, which is not scalable tofuture high line rates. In the block-sized scheme we use amodule, namely a reservation table, for book-keeping pur-poses. Each entry in the reservation table keeps track of thenumber of packets currently stored in a section in theDRAM banks. If the size of a DRAM section is N , then onlylog2N bits are required to track the number of packets in asection. On the other hand, a total of N bits are required topinpoint exactly which packets are currently stored in asection as in the frame-sized scheme. To avoid memory con-flicts during packet arrivals and departures, in the block-sized scheme the packets in a DRAM block are accessed col-lectively. The detailed function of the reservation table ispresented in the next section.

For an incoming packet, it is stored in a section in aDRAM bank belonging to a memory block based on itsdeparture time. There is a one-to-one mapping from a reser-vation table entry to a memory section. Once a packet isstored in a memory section, the corresponding reservationtable entry is updated by adding one to its current value.The arrival packets with departure times within N frames(i.e., N � b time slots) are written to the same memory block,where N is also the size of a DRAM section in the numberof packets. So there can be at most N � b arriving packetsstored in a memory block which can potentially hold up toN �K packets. Upon departure, the packets in a memoryblock are moved to a departure reorder buffer. The algo-rithm in selecting a DRAM bank to store an incoming packetbased on its departure time using the reservation table willbe introduced in Section 5.4.

5.2 Reservation Table

The reservation table is implemented in on-chip SRAMfor fast lookups and updates. The reservation table is nec-essary for keeping track of the packets in the DRAMbanks for the following reasons. First, our architectureneeds to guarantee that the packets in the DRAM bankswill depart at their pre-determined departure times. Sec-ond, a finite-size DRAM bank must not admit too manyarriving packets in a short time, which would otherwisecause it to overflow.

For the bitmap module in Section 4, although only a sin-gle bit is required per packet location, which is substantially

smaller than storing the whole packets, the size of the bit-map nonetheless grows linearly with the number of DRAMpacket buffer locations. Larger packet buffers would berequired in the future to match the increasing line rates, andthe bitmap would increase proportionally as a consequence.

Instead of using N bits as a bitmap to represent N packetlocations in a memory section, the block-sized packet bufferuses a counter of size log2 N bits to keep track of the actualnumber of packets present in theseN packet locations. SinceN can be represented using log2 N bits, log2 N bits are suffi-cient for maintaining a total of N packet locations. Thesepackets will be reordered upon departure in a departurereorder buffer. We denote the size of the reservation tableas sizeR. Let the packet buffer be of size Nmax packets. Thenthe linear bitmap module in Section 4 would require a sizeofNmax bits. However, our reservation table is only of size

sizeR ¼ Nmax � log2NN

: (3)

If we choose N to be N ¼ aNmax, where a is a constant anda < 1, (i.e., the size of a memory section is a fixed portionof the packet buffer, which is indeed the case), then

sizeR ¼ Nmax � log2ðaNmaxÞaNmax

¼ log2Nmax

aþ log2a

a: (4)

Therefore the size of the reservation table grows logarithmi-cally as the DRAM bank size grows.

The reduction of the size of the reservation table isachieved at the cost of larger departure reorder buffersimplemented in SRAM. Intuitively, since we only countthe number of packets in a memory section, informationof the relative orderings of the packets is lost. For a fixedsize buffer, the larger N is, the larger the constant a

becomes. Therefore the size of the reservation table sizeRdecreases as N increases as evident in Equation (4). Onthe other hand, as N increases the size of the departurereorder buffers increases since there are more packets ina memory block to be reordered before departure. In Sec-tion 7, we will show that for a fixed size buffer there is anoptimal value of N such that the total SRAM requirementis minimized.

5.3 Packet Access Conflicts

As in the frame-sized packet buffer, there need to be enoughDRAM banks in the system to store all the arriving packetsand to ensure packets depart at their departure times. First,let’s consider the three kinds of memory conflicts.

� Arrival write conflicts: See Section 4.2.

� Arrival read conflicts: See Section 4.2.

� Departure conflicts (or overflow conflicts): When a newpacket arrives, it will be stored in a memory blockbased on its departure time. Incoming packets withdeparture times within N frames are stored in thesame memory block. There are at most N � b packetsdeparting within N frames. However, a memory sec-tion can store only N packets at most. Without aneffective algorithm to choose DRAM banks to store

Fig. 2. Deterministic block-sized packet buffer architecture.


the incoming packets, in the worst case more than Npackets may be admitted to the same memory sec-tion, which would cause the bank to overflow.

Theorem 1. With at least 3b� 1 DRAM banks, where b is thenumber of time slots it takes for a DRAM bank to finish onememory transaction, it is always possible for a deterministicblock-sized reservation-based packet buffer to admit all thearrival packets and write them into memory blocks based ontheir departure times.

Proof. It is already clear that we need b� 1 banks to resolvethe arrival conflicts and b banks to resolve the departureconflicts. We only need to prove that b� 1 extra DRAMbanks are sufficient to resolve the overflow conflicts.

With K ¼ 3b� 1, a memory block can hold a total ofup to N �K ¼ ð3b� 1Þ �N packets. For any DRAM bank,one of its sections can hold at most N packets. On theother hand, there are at most N � b incoming packetswithin N frames. At any given time slot, there are atmost b� 1 arrival conflicts and b departure conflicts. Sothere are at least b free DRAM banks to store the currentincoming packet. For choosing a bank to store the currentincoming packet out of the b free banks, we implementthe following algorithm.

Water-Filling Algorithm: For an incoming packet p withdeparture time slot TdðpÞ, we check the corresponding memoryblock. The number of packets stored in the sections of theblock in the b free DRAM banks are denoted as CðDg1Þ,CðDg2Þ; . . . ;CðDgbÞ respectively. We choose a DRAM bankDgi to store the packet p based on the following:

i ¼ arg min1�j�b

CðDgjÞ: (5)

To prove the theorem, let’s consider the following sce-nario. After the current incoming packet is written to oneof the b free DRAM banks, the bank stays busy for thenext b time slots. After b time slots the same b DRAMbanks are free again for the newly arriving packet, whichhappens with non-zero probability. If a packet withdeparture time corresponding to the same memory blockarrives, the packet will then be written to the same blockin one of the b free banks (but maybe to a different bankamong the b banks depending on which one contains thefewest packets). Therefore, the worst case happens whenall of the packets with departure times corresponding tothe same memory block are to be stored in the same bDRAM banks. In this case, there are N � b packets and thememory block in the b banks can store up to N � b pack-ets. The water-filling algorithm above guarantees thatwith a storage space of size equal to N � b packets, we canalways store all of the N � b arriving packets, since themost empty bank is filled first. So in the worst case, b� 1extra DRAM banks are sufficient to solve the overflowconflicts, but fewer banks cannot solve such conflicts.Therefore, with the water-filling algorithm, a total of3b� 1 DRAM banks is sufficient to store all arrivingpackets. The memory block to store a packet is decidedby the packet departure time. tuAs in the frame-sized packet buffer, there are two depar-

ture reorder buffers in the block-sized packet buffer. Eachbuffer is capable of buffering up to N � b packets, which is

the maximum number of packets in a memory block. Thepackets are read from DRAM banks to a departure reorderbuffer following a simple algorithm: We choose the bankwith the largest number of packets in its corresponding sec-tion among all of the free banks based on the departuretime. There are at most N packets in a memory section. Onthe other hand, there are N � b time slots to read the packetsfrom a memory block to a departure reorder buffer. Withthe assumption that a read transaction in DRAM takes btime slots, it is always possible to read all the packets from amemory section even if the section is full. As in the frame-sized buffer, the packets are written to deterministic loca-tions in a departure reorder buffer based on their departuretimes. The packets with the earliest departure time leavesthe buffer first.

Based on the operation model, in the worst case the mini-mum round-trip latency for storing and retrieving a packetto and from one of the DRAM banks is ð2N þ 1Þ � b timeslots, including the current time slot. First, b time slots areneeded to write a packet into a DRAM bank, since it takes aDRAM bank b time slots to finish a write transaction. Uponpacket departure, at most N � b time slots are needed forreading packets from a memory block into one of the depar-ture reorder buffers, since there are at most N packetsstored in a memory section. Finally, at most N � b time slotsare needed to depart the packets from a departure reorderbuffer. Therefore for an incoming packet with departuretime within ð2N þ 1Þ � b time slots away, it may not beretrieved in time for departure if written to a memory bank.To guarantee deterministic packet departures, a bypassbuffer in SRAM is required to store these packets. The struc-ture of the bypass buffer is similar to the one in the frame-sized packet buffer as shown in Section 4.2.

5.4 Memory Management Implementationfor Block-Sized Design

Please refer to the section Memory Management Implemen-tation for Block-sized Design in the supplementary file,available online.

6 RANDOMIZED BLOCK-SIZED PACKET BUFFER

Please refer to the section Randomized Block-sized PacketBuffer in the supplementary file, available online.

7 PERFORMANCE ANALYSIS

In this section, we first investigate the optimal value ofN , thenumber of packets in a DRAM section so that the size ofSRAM is minimized in the deterministic block-based packetbuffer. As shown in Section 5.4, the sizes of the reservationtable, the departure reorder buffers, and the bypass buffersare all functions ofN . AsN grows, the size of the reservationtable decreases, and the sizes of the departure reorder buf-fers and bypass buffer increase. In Table 1, the line rate is40 Gb/s. We choose RTT ¼ 250 ms, b ¼ 16, and K ¼ 3b� 1.The fixed packet size is P ¼ 40 bytes. The optimal value ofNthat minimizes the SRAM requirement is N ¼ 128. The totalSRAMneeded in this optimal case is about 0:97MB.

In Table 2, the line rate is 100 Gb/s. We chooseRTT ¼ 250 ms, b ¼ 16, K ¼ 3b� 1 and P ¼ 40 bytes. TheSRAM size is minimized with N ¼ 256. The total SRAM


needed is about 1:57 MB. The optimal value of N that mini-mizes the SRAM size increases as the line rates increase.

It is worth noting that when N ¼ 1, the reservation tabledegrades to the read transaction bitmap in the frame-sizedpacket buffer.

In Table 3, we compare the memory requirements for theblock-sized packet buffer and the frame-sized one. The linerates are 10, 40, and 100 Gb/s. Let RTT ¼ 250 ms, b ¼ 16,K ¼ 3b� 1, and P ¼ 40 bytes. In general, the requiredSRAM bank size in the block-sized packet buffer only growslogarithmically with the line rates, while it grows linearlyfor the frame-sized packet buffer. At line rate 10 Gb/s, theoptimal block-sized packet buffer demands only 16 percentof the total SRAM required by the frame-sized one. Thisratio of SRAM requirement decreases to 8:1 and 6:5 percentat line rate 40 and 100 Gb/s, respectively. The ratio onlydecreases as line rate grows, which is a great advantage ofour block-sized packet buffer over the frame-sized packetbuffer for future high line rates.

In Table 4, we compare the SRAM requirement of theprefetching-based packet buffer [9] with the deterministicpacket buffers in this paper. We use RTT ¼ 250ms and linerate 40 Gb/s. Also, we have b ¼ 16, K ¼ 3b� 1, and P ¼ 40bytes. The SRAM requirement for our block-sized scheme isbased on Section 5.4. We assume the same setting for ourframe-sized scheme where one bit of SRAM is required inthe bitmap for each packet. The total SRAM size in ourframe-sized packet buffer is only 18:8 percent of the state-

of-the-art SRAM/DRAM prefetching buffer scheme, whilethe total SRAM size in our block-sized packet buffer is only1:5 percent of the prefetching buffer scheme. In the prefetch-ing-based scheme, SRAM is required to implement headand tail caches whose sizes grow linearly with the numberof logical queues. Commercial routers today support manylogical queues to implement classes-of-service. With Juniperrouters supporting up to 64K logical queues, the minimumSRAM required is 64 MB assuming packets are written tobulk DRAM in blocks of size 1;000 bits [9]. Further, the pre-fetching-based schemes require complex control logic formanaging the head and tail caches. Although these schemesare more general in that they can handle random packetdepartures, we show in this paper that our reservation-based designs are significantly more SRAM-efficient forrouter applications where the deterministic packet depar-ture setting is valid.

In Table 5, we compare the proposed randomized block-sized packet buffer with the other randomized packet bufferschemes. We assume RTT ¼ 250 ms, line rate C ¼ 40 Gb/s,

TABLE 2SRAM Sizes (in MB) for Different N with Line Rate at 100 Gb/s for Deterministic Block-Sized Packet Buffer

TABLE 3Comparison of SRAM Size Requirements

TABLE 4SRAM Requirement Comparison with

Prefetching-Based Packet Buffer

TABLE 1SRAM Sizes (in MB) for Different N with Line Rate 40 Gb/s for Deterministic Block-Sized Packet Buffer


and b ¼ 16. With an average packet size of 40 bytes, the totalsize of DRAM banks is 30 Gb or about 3:75 GB to store pack-ets of total size up to 1:25 GB in our randomized block-sizedpacket buffer scheme. The memory shown in the table arefor providing an overflow probability smaller than 10�10.2

It is worth noting that the randomized scheme in [11]doesn’t guarantee in-order departures across differentflows, since there is no reorder mechanism to ensure packetdepartures in their designated orders. On the other hand,both our randomized block-sized packet buffer and thescheme in [12] provide in-order packet departure across dif-ferent flows. Our randomized block-sized packet buffer isessentially a tradeoff of using slightly more DRAM banks toreduce the total amount of SRAM required comparing to[12]. While the efficiency of such a tradeoff is debatabledepending on the prices of DRAM and SRAM, we presentour randomized scheme to show that our deterministicpacket buffer architecture can be adapted to provide statisti-cal guarantees while requiring less SRAM storage.

All the results above are based on the assumption ofadmissible arrivals at any moment. For non-admissiblearrivals, the number of DRAM banks required can be easilyderived. Consider that the traffic is overshooting at someinput ports in a linecard by a factor of Hos due to oversub-scribing bursty arrivals. That is, during a short period oftime, the number of packets arriving at a packet buffer is Hostimes the number of packets for the maximum admissibletraffic. It is desirable to design a packet buffer to be able tohandle such oversubscribing bursty arrivals, as long as thearrival traffic is admissible in the long term (since otherwisethe packet buffer would be unstable), to provide morerobust services. For bursty arrival overshooting a linecardby a factor of Hos, the size of a time slot is effectively reducedby a factor of Hos and the effective line rate is increased by afactor of Hos. If the DRAM access latency stays unchanged, itwould take a DRAM bank bHos time slots to finished a mem-ory transaction. From Section 5.4s, the total size of theDRAM banks required is increased by a factor of Hos, byusing 3bHos � 1 memory banks. Also, the size of the reserva-tion table, bypass buffer, and departure reorder buffers areincreased by Hos, in order to guarantee that no packet isdropped from the short-lived oversubscribing bursty traffic.

In our deterministic frame-sized packet buffer, theround-trip-time of a packet is at least 3b time slots, which isalso the time it takes for a packet to be written to the packetbuffer and then retrieved for departure to the output asshown in Section 4.2. In our deterministic block-sizedpacket buffer, the round-trip-time of a packet is ð2N þ 1Þ � b

time slots as shown in Section 5.3. In our randomized block-sized packet buffer, the round-trip-time of a packet is atleast ð2N þ LmaxÞ � b time slots as shown in Section 6. Due tothe packet round-trip-time delay, bypass buffers areincluded in all three designs so that packets with departuretime less than the round-trip-time delay can be stored in thebypass buffers in order for the packets to leave the packetbuffer on time. Therefore, the sizes of the round-trip-time inall three designs are only relevant in the sizing of the bypassbuffers in the these designs. The round-trip-time of packetsin a packet buffer doesn’t represent the overall systemdelays of our designs. In our deterministic designs, it isguaranteed that all arriving packets can leave the packetbuffer at their predetermined departure times. In our ran-domized design, we provide statistical guarantee that allarriving packets can leave the packet buffer at their prede-termined departure times with high probability. It is alsoshown in this work that for our randomized design theprobability that a packet being dropped from the packetbuffer due to buffer overflow is bounded by a very smallvalue (1010) by choosing Lmax to be larger than 14, whichmakes the design applicable for most network applications.

In our design and the other packet buffer designs (suchas [11], [12]), there is no need for companion logic. Due tothe simplicity of the logic, network processors whichalready exists in network routers, can easily handle of mem-ory management and bit operations (NOR operation tolocate available memory bank).

Our block-sized packet buffer architectures are readilyimplemented for 100 Gb/s line rates. With a minimumpacket size of 64 bytes, we have a 5 ns packet time slot toprocess each packet. Given the small SRAM size of ourarchitecture (about 1:6 MB at 100 Gb/s), it can fit entirely inan on-chip SRAM. Each packet requires 4 clock cycles toread two reservation table entries, select the memory bank,and update a reservation table entry. At 1 GHz clock, these4 clock cycles correspond to 4 ns, which is less than the 5 nstime slot budget. Once the reservation table has beenupdated, the packet can be written to the DRAM banks orthe bypass buffer, which can occur concurrently with thereservation table processing for the next packet.

8 CONCLUSION

In this paper, we described novel reservation-based packetbuffer architectures that take advantage of the known depar-ture times of arriving packets to achieve simplicity and deter-minism. The switch-memory-switch router architecture [16],[17] and the load-balanced router architecture [20], [21],amongst others, are example architectures in which depar-ture times for packets can be deterministically calculated inadvance of packet insertion into packet buffers when best-effort routing is considered. We develop three new packetbuffer architectures: a deterministic frame-sized packetbuffer architecture, a deterministic block-sized packet bufferarchitecture, and a randomized block-sized packet buffer. Inparticular, a key contribution of this paper is the reservationtable in the deterministic block-sized packet buffer designthat only grows logarithmically with line rates, leading tomore than an order of magnitude reduction in SRAMrequirements compared to the frame-sized packet buffer

TABLE 5Comparison of Randomized Packet Buffer Schemes

2. The DRAM sizes are calculated for a fair comparison, since in[11], [12] a DRAM bank is assumed to work at double rate with oneread and one write every b time slots, while in our randomized block-sized buffer only one memory transaction is allowed in a DRAM bankevery b time slots.


design. Further, in our proposed architectures, the numberof the required interleaved DRAM banks is a small constantindependent of the arrival traffic pattern, the number offlows, and the number of priority classes, making it scalableto the growing packet storage requirements in future routerswhile matching increasing line rates.

ACKNOWLEDGMENTS

This work was presented in part at the 11th InternationalConference on High Performance Switching and Routing(HPSR’10), Dallas, TX, June 2010 [1] and at the 14th IEEESymposium on High-Performance Interconnects (Hot Inter-connects ’06), Stanford, CA, August 2006 [2].

REFERENCES

[1] H. Wang and B. Lin, “Block-Based Packet Buffer with Determin-istic Packet Departures,” Proc. 11th Workshop High PerformanceSwitching and Routing (HPSR ’10), June 2010.

[2] M. Kabra, S. Saha, and B. Lin, “Fast Buffer Memory with Deter-ministic Packet Departures,” Proc. 14th Ann. Symp. High Perfor-mance Interconnects (Hot Interconnects ’06), pp. 67-72, Aug. 2006.

[3] C. Villamizar and C. Song, “High Performance TCP in ANSNET,”ACM SIGCOMM Computer Comm. Rev., vol. 24, no. 5, pp. 45-60,1994.

[4] CAIDA, “Round-Trip Time Measurements from CAIDA’sMacroscopic Internet Topology Monitor,” http://www.caida.org/analysis/performance/rtt/walrus0202, 2013.

[5] G. Appenzeller, I. Keslassy, and N. McKeown, “Sizing Router Buf-fers,” ACM SIGCOMM Computer Comm. Rev., vol. 34, pp. 281-292,2004.

[6] A. Dhamdhere and C. Dovrolis, “Open Issues in Router BufferSizing,” ACM SIGCOMM Computer Comm. Rev., vol. 36, pp. 87-92,Jan. 2006.

[7] Samsung, “Samsung K7S3236U4C QDRII SRAM,” http://www.samsung.com/, 2013.

[8] “Samsung, Samsung K4B4G0446A DDR3 SDRAM,” http://www.samsung.com/, 2013.

[9] S. Iyer, R.R. Kompella, and N. McKeown, “Designing Packet Buf-fers for Router Linecards,” IEEE/ACM Trans. Networking, vol. 16,no. 3, pp. 705-717, June 2008.

[10] J. Garc�ıa, M. March, L. Cerd�a, J. Corbal, and M. Valero, “ADRAM/SRAM Memory Scheme for Fast Packet Buffers,” IEEETrans. Computers, vol. 55, no. 5, pp. 588-602, May 2006.

[11] G. Shrimali and N. McKeown, “Building Packet Buffers UsingInterleaved Memories,” Proc. Sixth Workshop High PerformanceSwitching and Routing (HPSR ’05), pp. 1-5, May 2005.

[12] D. Lin, M. Hamdi, and J. Muppala, “Distributed Packet Buffers forHigh-Bandwidth Switches and Routers,” IEEE Trans. Parallel andDistributed Systems, vol. 23, no. 7, pp. 1178-1192, July 2012.

[13] H. Attiya, D. Hay, and I. Keslassy, “Packet-Mode Emulation ofOutput-Queued Switches,” IEEE Trans. Computers, vol. 59, no. 10,pp. 1378-1391, Oct. 2010.

[14] J. Turner, “Strong Performance Guarantees for AsynchronousBuffered Crossbar Schedulers,” IEEE/ACM Trans. Networking,vol. 17, no. 4, pp. 1017-1028, Jan. 2009.

[15] A. Prakash, A. Aziz, and V. Ramachandran, “Randomized Paral-lel Schedulers for Switch-Memory-Switch Routers: Analysis andNumerical Studies,” Proc. IEEE INFOCOM, vol. 3, pp. 2026-2037,Mar. 2004.

[16] A. Prakash, S. Sharif, and A. Aziz, “An O(log2 n) Parallel Algo-rithm for Output Queuing,” Proc. IEEE INFOCOM ’02, vol. 3,pp. 1623-1629, June 2002.

[17] S. Iyer, R. Zhang, and N. McKeown, “Routers with a Single Stageof Buffering,” ACM SIGCOMM Computer Comm. Rev., vol. 32,no. 4, pp. 251-264, 2002.

[18] J. Networks, http://www.juniper.net/us/en/products-services/routing/m-series, 2013.

[19] Brocade, http://www.brocade.com/products/all/switches/index. page, 2013.

[20] C.-S. Chang, D.-S. Lee, and Y.-S. Jou, “Load Balanced Birkhoff-Von Neumann Switches, Part I: One-Stage Buffering,” ComputerComm., vol. 25, pp. 611-622, 2002.

[21] I. Keslassy, S.-T. Chuang, K. Yu, D. Miller, M. Horowitz, O. Sol-gaard, and N. McKeown, “Scaling Internet Routers Using Optics,”ACM SIGCOMM Computer Comm. Rev., vol. 33, pp. 189-200, 2003.

[22] A. Greenberg, P. Lahiri, D.A. Maltz, P. Patel, and S. Sengupta,“Towards a Next Generation Data Center Architecture: Scalabilityand Commoditization,” Proc. ACMWorkshop Programmable Routersfor Extensible Services of Tomorrow (PRESTO), pp. 57-62, Aug. 2008.

[23] Q. Huang, Y.-K. Yeo, and L. Zhou, “A Single-Stage Optical Load-Balanced Switch for Data Centers,” Optics Express, vol. 20, no. 22,pp. 25014-25021, 2012.

[24] H. Zhao, H. Wang, B. Lin, and J. Xu, “Design and PerformanceAnalysis of a DRAM-Based Statistics Counter ArrayArchitecture,” Proc. Fifth ACM/IEEE Symp. Architectures for Net-working and Comm. Systems (ANCS ’09), pp. 84-93, Oct. 2009.

[25] H. Wang, H. Zhao, B. Lin, and J. Xu, “Design and Analysis of aRobust Pipelined Memory System,” Proc. IEEE INFOCOM ’10,Mar. 2010.

[26] B.R. Rau, “Pseudo-Randomly Interleaved Memory,” Proc. 18thAnn. Int’l Symp. Computer Architecture (ISCA ’91), 1991.

[27] S.I. Hong, S.A. McKee, M.H. Salinas, R.H. Klenke, J.H. Aylor, andW.A. Wulf, “Access Order and Effective Bandwidth for Streamson a Direct Rambus Memory,” Proc. Fifth Int’l Symp. High Perfor-mance Computer Architecture (HPCA ’99), p. 80, Jan. 1999.

[28] W. Lin, S.K. Reinhardt, and D. Burger, “Reducing DRAM Laten-cies with an Integrated Memory Hierarchy Design,” Proc. FifthInt’l Symp. High Performance Computing Architecture (HPCA ’01),Jan. 2001.

[29] J. Hasan, S. Chandra, and T.N. Vijaykumar, “Efficient Use ofMemory Bandwidth to Improve Network Processor Throughput,”ACM SIGARCH Computer Architecture News, vol. 31, no. 2, pp. 300-313, 2003.

Hao Wang (S ’06) received the BE degree inelectrical engineering from Tsinghua University,Beijing, P.R. China, and the MS and PhDdegrees in electrical and computer engineeringfrom the University of California, San Diego, in2005, 2008, and 2011, respectively. He is cur-rently with Oracle Corporation in RedwoodShores, California. His research interests includethe architecture and scheduling algorithms forhigh speed switching and routing, robust memorysystem design, network measurement, traffic

management for wireless and wired network systems, large deviationprinciple, convex ordering, and coding and information theory. He is astudent member of the IEEE.

Bill Lin (M ’97) received the BS, MS, and PhDdegrees in electrical engineering and computersciences from the University of California, Berke-ley. He is currently a professor of electrical andcomputer engineering and an adjunct professorof computer science and engineering, both at theUniversity of California, San Diego. At UCSD, heis actively involved with the Center for WirelessCommunications (CWC), the Center for Net-worked Systems (CNS), and the California Insti-tute for Telecommunications and Information

Technology (CAL-IT2) in industry-sponsored research efforts. Prior tojoining the faculty at UCSD, he was the head of the System Control andCommunications Group at IMEC, Belgium. IMEC is the largest indepen-dent microelectronics and information technology research center inEurope. It is funded by European funding agencies in joint projects withmajor European telecom and semiconductor companies. His researchhas led to over 130 journal and conference publications. He has receiveda number of publication awards, including the 1995 IEEE Transactionson VLSI Systems Best Paper Award, a Best Paper award at the 1987ACM/IEEE Design Automation Conference, Distinguished Paper cita-tions at the 1989 IFIP VLSI Conference and the 1990 IEEE InternationalConference on Computer-Aided Design, a Best Paper nomination at the1994 ACM/IEEE Design Automation Conference, and a Best Papernomination at the 1998 Conference on Design Automation and Test inEurope. He received four awarded patents. He is a member of the IEEE.


Reservation based packet buffers with deterministic packet departures

Education

Transcript of Reservation based packet buffers with deterministic packet departures