Network interfaces for programmable NICs and multicore platforms

20
Network interfaces for programmable NICs and multicore platforms Andrés Ortiz a, * , Julio Ortega b , Antonio F. Díaz b , Alberto Prieto b a Department of Communications Engineering, University of Málaga, Spain b Department of Computer Architecture and Technology, University of Granada, Spain article info Article history: Received 2 December 2008 Accepted 23 September 2009 Available online 26 September 2009 Responsible Editor: I.F. Akyildiz Keywords: Full-system simulation HDL simulation LAWS model Protocol offloading Network interfaces Simics abstract The availability of multicore processors and programmable NICs, such as TOEs (TCP/IP Off- loading Engines), provides new opportunities for designing efficient network interfaces to cope with the gap between the improvement rates of link bandwidths and microprocessor performance. This gap poses important challenges related with the high computational requirements associated to the traffic volumes and wider functionality that the network interface has to support. This way, taking into account the rate of link bandwidth improve- ment and the ever changing and increasing application demands, efficient network inter- face architectures require scalability and flexibility. An opportunity to reach these goals comes from the exploitation of the parallelism in the communication path by distributing the protocol processing work across processors which are available in the computer, i.e. multicore microprocessors and programmable NICs. Thus, after a brief review of the different solutions that have been previously proposed for speeding up network interfaces, this paper analyzes the onloading and offloading alter- natives. Both strategies try to release host CPU cycles by taking advantage of the commu- nication workload execution in other processors present in the node. Nevertheless, whereas onloading uses another general-purpose processor, either included in a chip mul- tiprocessor (CMP) or in a symmetric multiprocessor (SMP), offloading takes advantage of processors in programmable network interface cards (NICs). From our experiments, imple- mented by using a full-system simulator, we provide a fair and more complete comparison between onloading and offloading. Thus, it is shown that the relative improvement on peak throughput offered by offloading and onloading depends on the rate of application work- load to communication overhead, the message sizes, and on the characteristics of the sys- tem architecture, more specifically the bandwidth of the buses and the way the NIC is connected to the system processor and memory. In our implementations, offloading pro- vides lower latencies than onloading, although the CPU utilization and interrupts are lower for onloading. Taking into account the conclusions of our experimental results, we propose a hybrid network interface that can take advantage of both, programmable NICs and mul- ticore processors. Ó 2009 Elsevier B.V. All rights reserved. 1. Introduction The availability of high-bandwidth links (Gigabit Ether- net, Myrinet, QsNet, etc.) [51], and the scale up of network I/O bandwidths to multiple gigabits per second have shifted the communication bottleneck towards the net- work nodes. Therefore, an optimized design of the network interface (NI) is becoming decisive in the overall communi- cation path performance. The gap between the user communication require- ments of in-order and reliable message delivery and deadlock safety and network features such as arbitrary delivery order, limited fault-handling, and finite buffering 1389-1286/$ - see front matter Ó 2009 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2009.09.011 * Corresponding author. Tel.: +34 952134166; fax: +34 952132027. E-mail address: [email protected] (A. Ortiz). Computer Networks 54 (2010) 357–376 Contents lists available at ScienceDirect Computer Networks journal homepage: www.elsevier.com/locate/comnet

Transcript of Network interfaces for programmable NICs and multicore platforms

Page 1: Network interfaces for programmable NICs and multicore platforms

Computer Networks 54 (2010) 357–376

Contents lists available at ScienceDirect

Computer Networks

journal homepage: www.elsevier .com/ locate/comnet

Network interfaces for programmable NICs and multicore platforms

Andrés Ortiz a,*, Julio Ortega b, Antonio F. Díaz b, Alberto Prieto b

a Department of Communications Engineering, University of Málaga, Spainb Department of Computer Architecture and Technology, University of Granada, Spain

a r t i c l e i n f o a b s t r a c t

Article history:Received 2 December 2008Accepted 23 September 2009Available online 26 September 2009Responsible Editor: I.F. Akyildiz

Keywords:Full-system simulationHDL simulationLAWS modelProtocol offloadingNetwork interfacesSimics

1389-1286/$ - see front matter � 2009 Elsevier B.Vdoi:10.1016/j.comnet.2009.09.011

* Corresponding author. Tel.: +34 952134166; faxE-mail address: [email protected] (A. Ortiz).

The availability of multicore processors and programmable NICs, such as TOEs (TCP/IP Off-loading Engines), provides new opportunities for designing efficient network interfaces tocope with the gap between the improvement rates of link bandwidths and microprocessorperformance. This gap poses important challenges related with the high computationalrequirements associated to the traffic volumes and wider functionality that the networkinterface has to support. This way, taking into account the rate of link bandwidth improve-ment and the ever changing and increasing application demands, efficient network inter-face architectures require scalability and flexibility. An opportunity to reach these goalscomes from the exploitation of the parallelism in the communication path by distributingthe protocol processing work across processors which are available in the computer, i.e.multicore microprocessors and programmable NICs.

Thus, after a brief review of the different solutions that have been previously proposedfor speeding up network interfaces, this paper analyzes the onloading and offloading alter-natives. Both strategies try to release host CPU cycles by taking advantage of the commu-nication workload execution in other processors present in the node. Nevertheless,whereas onloading uses another general-purpose processor, either included in a chip mul-tiprocessor (CMP) or in a symmetric multiprocessor (SMP), offloading takes advantage ofprocessors in programmable network interface cards (NICs). From our experiments, imple-mented by using a full-system simulator, we provide a fair and more complete comparisonbetween onloading and offloading. Thus, it is shown that the relative improvement on peakthroughput offered by offloading and onloading depends on the rate of application work-load to communication overhead, the message sizes, and on the characteristics of the sys-tem architecture, more specifically the bandwidth of the buses and the way the NIC isconnected to the system processor and memory. In our implementations, offloading pro-vides lower latencies than onloading, although the CPU utilization and interrupts are lowerfor onloading. Taking into account the conclusions of our experimental results, we proposea hybrid network interface that can take advantage of both, programmable NICs and mul-ticore processors.

� 2009 Elsevier B.V. All rights reserved.

1. Introduction

The availability of high-bandwidth links (Gigabit Ether-net, Myrinet, QsNet, etc.) [51], and the scale up of networkI/O bandwidths to multiple gigabits per second have

. All rights reserved.

: +34 952132027.

shifted the communication bottleneck towards the net-work nodes. Therefore, an optimized design of the networkinterface (NI) is becoming decisive in the overall communi-cation path performance.

The gap between the user communication require-ments of in-order and reliable message delivery anddeadlock safety and network features such as arbitrarydelivery order, limited fault-handling, and finite buffering

Page 2: Network interfaces for programmable NICs and multicore platforms

358 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

capabilities makes it necessary to have (layers of) proto-cols that provide the communication services not sup-ported by the network hardware, and required by theapplications [35]. Besides processing these protocols, thedata to be sent or received need to be transferred be-tween the main memory and the buffers of the NIC (Net-work Interface Circuit) that accesses the network linksthrough the media access control (MAC). Moreover, thisdata transference requires the cooperation between thedevice driver in the OS and the NIC, that have to maintainthemselves coordinated by using information about thestate of the frames that are being transmitted. In thisway, the driver has to indicate to the NIC that there isdata to be sent or buffer space for data receiving, andthe NIC has to notify to the OS that data has been sentto or received from the network.

Thus, the sources of the network processing overheadare protocol processing; operating system activities such askernel processes, device driver overheads, context switch-ing, interrupt handling and buffer management; andmemory access overheads due to packet and data copiesand processor stalls. As higher link bandwidths are avail-able, memory access and some operating system over-heads are even more important due to the poor cachelocality of the kernel network processing tasks [43]. Thus,the most part of the memory accesses are DRAM accessesand both the lower rate of improvement for DRAM accesslatency and the memory bus bandwidth shift the bottle-neck to this overhead component. Moreover, as the OSand the NIC have to exchange a large volume of dataand control information through the I/O buses, they areother important components in the network overhead[34].

Much research work has been carried out in order toimprove the communication performance. This researchcan be classified into two complementary researchinglines. One of these lines searches for decreasing the soft-ware overhead in the communication protocols either byoptimizing the TCP/IP layers, or by using new and lighterprotocols. Moreover, these new protocols usually fall intoone of two types: the protocols that optimize the operat-ing system communication support (GAMMA [27], CLIC[28], etc.), and the user level network interfaces [29], suchas the VIA (Virtual Interface Architecture) standard [30].These proposals include features that have been specifi-cally proposed to accelerate network processing in thiscontext along with some improvements in the NICs. Some

Table 1Relation among NI optimizing features and the main sources of commu-nication overhead.

Protocolprocessing

Memoryaccesses

Operating systemoverhead

Zero-copyp p

Interruptcoalescing

p

Jumbo framesp p

Cheksumoffloading

p

Headersplitting

p p p

LSOp

of these features are the following ones (Table 1 showstheir relation with the above mentioned overheadsources):

– Zero-copy [36] tries to eliminate the copies between theuser and kernel buffers by solving some difficultiesrelated with the interaction of OS buffering schemes,virtual memory, and the API [38].

– Interrupt optimization techniques [29] either reduce theinterrupt frequency by interrupting the CPU once multi-ple packets have arrived instead of issuing one interruptper packet (interrupt coalescing), or use polling instead ofinterrupts (interrupts are only generated whenever theNIC has not been polled after a given amount of time)[37–39].

– Jumbo frames, proposed by Alteon [52], allow the use offrames of up to 9000 bytes, which are larger than theEthernet maximum frame size of 1500 bytes, in orderto reduce the per-frame processing overhead.

– Checksum offloading [40] allows the NIC to determineand insert the checksums into the packets to be sent,and to check the received packets in order to avoid thecorresponding CPU overhead.

– Header splitting [7,34] separates protocol headers andpayload in different buffers to decrease the CPU cachepollution when the headers are processed and to aid inthe zero-copy of the received payloads.

– Large send offload (LSO) [7,34] builds in the NIC the(large) TCP segments to be sent.

Another possibility to reduce communication overheadis the distribution of the network workload among otherexisting processors in the node. This way, the software ispartitioned between the host CPU and another processorthat executes the communication tasks. Thus, the hostCPU does not have to process the network protocols andcan devote more cycles to the user applications and otheroperating system tasks. Two main alternatives have beendistinguished depending on the location of the processorwhere the communication tasks are executed.

One of these alternatives proposes the use of processorsincluded in the network interface cards (NIC) for protocolprocessing. In this case, the NIC can directly interact withthe network without the host CPU participation, thusallowing not only a decrease in the CPU overhead for inter-rupt processing, but also a protocol latency reduction forshort control messages (such as ACKs) that, this way, donot have to access to the main memory through the I/Obus. There are many commercial designs that offload dif-ferent parts of the TCP/IP protocol stack onto a NIC at-tached to the I/O bus [16–18]. These devices are calledTCP/IP Offload Engines (TOE), and thus, this alternative isusually called protocol offloading. Other techniques relatedwith protocol offloading are connection handoff [15] and net-work interface data caching [41]. The connection handofftechnique allows the operating system to control the num-ber of TCP connections that are offloaded to the NIC to takeadvantage of its features without overloading it. Networkinterface data caching reduces traffic along the local busesby caching the frequently-requested data using on-boardDRAM included in the programmable NIC.

Page 3: Network interfaces for programmable NICs and multicore platforms

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 359

Nevertheless, besides the works showing the advanta-ges of protocol offloading, some papers have presented re-sults arguing that this technique does not benefit the userapplications. Particularly, TCP/IP offloading has been highlycontroversial because, as some studies have demonstrated,TCP/IP processing costs are small (particularly after thattask were optimized in the late 1980s [7,8]) compared todata transference overheads and the costs of interfacingthe protocol stack to the NIC and the operating system.As the experimental results provided by the papers advo-cating each alternative are not conclusive because they de-pend on the specific technologies, systems andapplications used in the experiments, interest in offloadingperformance analysis still remains.

Another alternative to release host CPU cycles is the socalled protocol onloading or, more specifically, TCP onload-ing [7]. This technique proposes the use of a general-pur-pose processor in a CMP or in an SMP for protocolprocessing. Although it has been proposed as opposed toNIC offloading, despite its name, it can also be consideredas a full offload to another processor in the node, ratherthan to the NIC [5,15]. Nevertheless, in what follows, wewill maintain both terms to distinguish between the useof a processor in the NIC for protocol processing (offload-ing) or other general-purpose processor (onloading).

There are some papers that compare both techniques[5,20]. Nevertheless, it is difficult to make experimentsby using systems with similar characteristics and to ex-plore the whole parameter space. Thus, simulation is nec-essary to obtain right conclusions about the performancesof both techniques along a representative set of parametervalues. In this paper, we provide an approach to this byusing a full-system simulator. We first provide a briefintroduction to protocol offloading and onloading that out-lines their characteristics, differences, and relative advan-tages and drawbacks (Section 2). Then, the previouslyproposed LAWS model [10] is described. It allows a firstanalysis of the possible benefits of moving the communica-tion workload to other processors in the node in terms ofthe relative computation/communication workloads andthe different technological capabilities of the processors(Section 3). In Section 4, we describe our offloading andonloading implementations and the experimental setupbased on the full-system simulator SIMICS. Finally, Section5 provides the experimental results and the conclusionsare given in Section 6.

2. Offloading and onloading approaches to improvecommunication

As it has been said, many works have proposedimprovements in the communication architecture of highperformance platforms that use commodity networks andgeneric protocols such as TCP/IP. The use of other proces-sors in the computer to reduce the communication over-head in the host CPUs that run the applications has beenproposed in many works since years. Moreover, the parall-elization of network protocols and the exploitation of par-allelism in programmable network interfaces have beenalso proposed and analyzed [44–46].

Thus, an alternative, usually called protocol offloading,proposes the use of processors included in the NIC,whereas the onloading [7] strategy tries to take advantageof the existence of multiple cores in a CMP or processors inan SMP. These two strategies have been commercially re-leased through implementations such as TOEs, TCP Offload-ing Engines [16–18]; and the Intel I/OAT, I/O AccelerationTechnology [22,23,42], that includes an optimized onloadedprotocol stack as one of its features along with headersplitting, interrupt coalescing, and enhanced DMA trans-fers through asynchronous I/O copy [34].

Onloading and offloading have common advantages anddrawbacks and also have differences that determine theirspecific performance. Among their common advantages,we have the following ones:

– The availability of CPU cycles for the applicationsincreases as the host CPU does not have to process com-munication protocols. The overlap between communica-tion and computation also increases.

– The host CPU receives less interrupts to attend thereceived messages.

– The use of specific programmable processors withresources that exploit different levels of parallelismcould improve the efficiency in the processing of thecommunication protocols and enable a dynamic proto-col management in order to use the more adequate pro-tocol (according to the data to communicate and thedestination) to build the message.

In addition to these advantages, offloading offers someothers:

– As the NIC implements the communication protocols, itcan directly interact with the network without the CPUinvolvement. Thus, the protocol latency can be reduced,as short messages, such as the ACKs, do not need to tra-vel through the E/S bus; and the CPU does not have toprocess the corresponding interrupts for context chang-ing to attend these messages.

– As protocol offloading can contribute to avoid traffic onthe I/O bus (commands between the CPU and the NIC,and some DMA transferences between the main mem-ory and the NIC), the bus contention could be reduced.It is also possible to improve the efficiency of the DMAtransferences from the NIC if the short messages areassembled to generate less DMA transferences.

With respect to the drawbacks of offloading, someworks [6–9] provide experimental results to argue thatprotocol offloading, in particular TCP offloading, does notclearly benefit the communication performance of theapplications. Among the reasons to support these conclu-sions, we have the following ones:

– The host CPU speed is usually higher than the speed ofthe processors in the NIC and, moreover, the incrementin the CPU speeds according to Moore’s law tends tomaintain this ratio. Thus, the part of the protocol thatis offloaded would require more execution time in theNIC than in the CPU, and the NIC could become the

Page 4: Network interfaces for programmable NICs and multicore platforms

360 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

communication bottleneck. Moreover, the limitations inthe resources (particularly memory) available in the NICcould imply restrictions in the system scalability (forexample, limitations in the size of the IP routing table).

– The communication between the NIC (executing the off-loaded protocol), and the CPU (executing the API) couldbe as complex as the protocol to be offloaded [6,9]. Pro-tocol offloading requires the coordination between theNIC and the OS for a correct management of resourcessuch as the buffers, the port numbers, etc. In case of pro-tocols such as TCP, the control of the buffers is compli-cated and could hamper the offloading benefits (forexample, the TCP buffers must be held until acknowl-edged or pending reassembly) [6].

The main specific advantage of onloading is preciselyrelated with the first drawback of offloading: as in onload-ing, the processor that executes the communication soft-ware has the same speed and characteristics as the hostCPU. In particular, it can access the main memory withthe same rights as the host CPU. This alternative exploitsthe current trend towards multicore architectures or SMPs.However, in relation with this point, some objection couldbe raised, as the use of simpler and more power and areaefficient cores added to the general purpose host CPU (amicroarchitecture usually found in network processors)could reach similar network processing acceleration withlower cost in area, power, and complexity [21].

In some recent proposals, onloading is applied alongwith other strategies to define technologies for acceleratingnetwork processing [22]. They are sometimes considered asalternatives opposed to the use of TOEs [20]. Nevertheless,some of the strategies comprised in these technologies canalso be implemented although protocol processing is car-ried out in the NIC. For example, the improvement in thenumber of interrupts, the DMA transference optimizations,and the use of mechanisms to avoid bottlenecks in the re-ceiver-side such as split headers, asynchronous copy byusing DMA, and multiple receive queues [23,25] could bealso implemented in TOE-based approaches.

In any case, it is clear that network interface optimiza-tion requires a system approach that takes into account notonly the processors present in the computer, but also thechipset, the buses, the memory accesses, the operating sys-tem, the computation/communication profile of the appli-cations, and the corresponding interactions among theseelements. Thus, it is not easy to predict when offloadingis better than onloading or vice versa.

In this paper, we also explore the possibilities that afull-system simulator provides for that purpose. Thus, wehave used the SIMICS full-system simulator with modelsdeveloped by us to evaluate the onloading and offloadingperformance. These models allow to overcome some limi-tations of SIMICS, which does not provide either accuratetiming models or TOE models by itself.

3. The LAWS model

Some authors have proposed performance models[10,11] to understand the offloading fundamental princi-

ples under the experimental results and to drive the dis-cussions over offloading technologies, allowing us theexploration of the corresponding design space [32]. Thepaper [10] introduces the LAWS model to characterizethe protocol offloading benefits in Internet services andstreaming data applications. In [11], the EMO (ExtensibleMessage-oriented Offload) model is proposed to analyzethe performance of various offload strategies for messageoriented protocols. Here, we will use the LAWS model tounderstand the behavior of the simulated systems because,although it was proposed for offloading, it can in fact bealso applied whenever the communication overhead is dis-tributed between the host CPU and other processors in thenode (either in the NIC, or in another CPU in a multiproces-sor node). In the following summary of the LAWS model,we use the generic term communication processor (CP) torefer to the processor (in the NIC or in the multiprocessornode) that executes the networking tasks in case of off-loading or onloading.

The LAWS model gives an estimation of the peakthroughput of the pipelined communication path accordingto the throughput provided by the corresponding bottle-neck in the system: the link, the CP, or the host CPU. Themodel only includes applications that are throughput lim-ited (such as Internet servers), and thus fully pipelined,when the parameters used by the model (CPU occupancyfor communication overhead and for application process-ing, occupancy scale factors for host and NIC processing,etc.) can be accurately known. The analyses provided in[10] consider that the performance is host CPU limited be-fore applying the protocol offloading (this technique neveryields any improvement otherwise).

Fig. 1 illustrates the way the LAWS model views the sys-tem before and after offloading. The notation which is usedis similar to that of [10]. Before offloading (Fig. 1a), the sys-tem is considered as a pipeline with two stages, the hostand the network. In the host, to transfer m bits, the appli-cation processing causes a host CPU work equal to aXm andthe communication processing produces a CPU work oXm.In these processing delays, a and o are the amount of CPUwork per data unit, and X is a scaling parameter used totake into account variations in processing power with re-spect to a reference host. Moreover, the latency to providethese m bits by a network link with a bandwidth equal to B,is m/B.

Thus, as the peak throughput provided before offloadingis determined by the bottleneck stage, we haveBbefore ¼minðB;1=ðaX þ oXÞÞ. After offloading, we have apipeline with three stages (Fig. 1b), and a portion p of thecommunication overhead has been transferred to the CP.In this way, the latencies in the stages for transferring mbits are m/B for the network link, aXm+(1-p)oXm for theCPU stage, and poYbm for the CP. In the expression forthe NIC latency, Y is a scaling parameter to take into ac-count the difference in processing power with respect toa reference and b is a parameter that quantifies theimprovement in the communication overhead that couldbe reached with offloading, i.e. bo is the normalized over-head that remains in the system after offloading, whenp ¼ 1 (full offloading). A similar reasoning for onloading isalso possible.

Page 5: Network interfaces for programmable NICs and multicore platforms

Fig. 1. A view of the LAWS model before offloading (a) and after offloading (b).

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 361

In this way, after offloading/onloading the peak through-put is Bafter ¼minðB;1=ðaX þ ð1� pÞoXÞ;1=poYbÞ and therelative improvement in peak throughput is defined asdb ¼ Bafter � BbeforeÞ=Bbefore. The LAWS acronym comes fromthe parameters used to characterize the offloading bene-fits. Besides the parameter b (Structural ratio), we havethe parameters a ¼ Y=X (Lag ratio), that considers the ratiobetween the CPU speed to NIC computing speed; c ¼ a=o(Aplication ratio), that measures the compute/communica-tion ratio of an application; and r ¼ 1=oXB (Wire ratio),that corresponds to the portion of the network bandwidththat the host can provide before offloading. In terms of theparameters a; b; c, and r, the relative peak throughputimprovement can be expressed as:

db ¼min 1

r ;1

cþð1�pÞ ;1

pab

� ��min 1

r ;1

1þc

� �

min 1r ;

11þc

� � :

From LAWS, some conclusions can be derived in terms ofsimple relationships among the four LAWS ratios [10]:

– Protocol offloading/onloading provides an improvementthat grows linearly in applications with low computa-tion/communication rate (low c). This profile corre-sponds to streaming data processing applications,network storage servers with large number of disks,

etc. In case of CPU intensive application, the throughputimprovement reached by offloading is bounded by 1=cand goes to zero as the computation cost increases (i.e.c grows). The best improvement is obtained forc ¼maxðab;rÞ. Moreover, as the slope of the improve-ment function ðcþ 1Þ=c � 1 is 1/c, where c ¼maxðab;rÞ, the throughput improvement grows faster asab or r decrease.

– Protocol offloading may reduce the communicationthroughput (negative improvement) if the functionðcþ 1Þ=c � 1 is able to take negative values. This meansthat c < ðc � 1Þ and as c > 0 and r < 1, it should verifythat c ¼ ab and ab > 1. Thus, if the NIC speed is lowerthan the host CPU speed ða > 1Þ, offloading may reduceperformance if the NIC get saturated before the networklink, i.e. ab > r, as the improvement is bounded by 1=a(whenever b ¼ 1). Nevertheless, if an efficient offloadimplementation (for example, by using direct dataplacement techniques) allows structural improvements(thus, a reduction in b), it is possible to maintain the off-loading usefulness for values of a > 1.

– There is no improvement in slow networks ðr� 1Þwhere the host is able to assume the communicationoverhead without aid. The offload usefulness can be highwhenever the host is not able to communicate at linkspeed ðr� 1Þ, but in these circumstances, c has to below, as it has been previously said. As there is a trend

Page 6: Network interfaces for programmable NICs and multicore platforms

362 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

to faster networks (r decreases), offloading/onloadingcan be seen as very useful techniques. When r is nearto one, the best improvement corresponds to cases witha balance between computation and communicationbefore offloading/offloading ðc ¼ r ¼ 1Þ.

The communication path is not fully pipelined as it issupposed by LAWS. Thus, this performance model can onlyprovide an upper bound for performance. Nevertheless,LAWS can be useful, as it gives an idea about the trendsin the improvement that could be achieved according tothe changes implemented in some relevant parameters ofreal communication systems. The simulation results pro-vided in this paper also give certain experimental valida-tion of the conclusions provided by the LAWS model.

4. The simulation environment and our proposednetwork interfaces

Simulation can be considered the most frequent tech-nique to evaluate computer architecture proposals. It al-lows to explore the design space of the differentarchitectural alternatives independently of specific (com-mercial) implementations available at a given moment.Nevertheless, the research in computer system design is-sues dealing with (high-bandwidth) networking requiresan adequate simulation tool that allows running commer-cial OS kernels (as the most part of the network code runsat the system level), and other features for network-ori-ented simulation, such as a timing model of the networkDMA activity and a coherent and accurate model of thesystem memory [2]. Some examples of simulators withthese characteristics are M5 [3], SimOS [14], and someother simulators based on SIMICS [1,4], such as GEMS[12] and TFsim [13].

SIMICS [1,4] is a commercial full-system simulator thatallows the simulation of application code, operating sys-tem, device drivers and protocol stacks running on themodeled hardware. Although SIMICS presents some limita-tions for I/O simulation, in [19], we propose a way to over-come them in order to get accurate experimental

Fig. 2. Simulation model f

evaluation of protocol offloading. The proposed simulationmodels include customized machines and a standardEthernet network connecting them in the same way aswe could have in the real world. Nevertheless, in the sim-ulation models of [19], the cache behavior it is not in-cluded. In this paper, we use SIMICS to simulate not onlyoffloading but also onloading implementations, and wehave also developed more complete SIMICS models that in-clude cache in the nodes.

We have also developed three different system imple-mentations for implementing our simulations: a base sys-tem, and offloaded and onloaded implementations. Oncewe have the machines defined, SIMICS allows an operatingsystem to be installed on them (Debian Linux with a 2.6kernel in our case). To make it possible running Linux inthis simulation model (neither requiring any kernel changenor the design of a new driver), we have taken advantageof some SIMICS features. By default, all the buses in SIMICSare simply considered as connectors. Nevertheless,although there is not any functional difference between aPCI bus and the system memory bus, it is possible to definedifferent access latency to memory and to the PCI devices.Moreover, it is also possible to define different delays forinterrupts coming from the PCI device, other processor, etc.

The first simulated platform corresponds to a base sys-tem, in which we have used a superscalar processor andNIC models provided by SIMICS for PCI based gigabit Ether-net cards. With this model (Fig. 2), we have determined themaximum performance which we can achieve using a ma-chine with one processor and no offloading/onloading ef-fects. Thus, the host CPU of the system (CPU0) executesthe application and processes the communicationprotocols.

The operation of the network interface with offloadingis summarized in Fig. 3a. After a network transfer (1), thenetwork interface card at the receiver starts to fill the ringbuffer with the received packets. When this buffer is full,data is transferred to the NIC memory (2). When the mem-ory transfer is completed, an interrupt is sent to the hostCPU (CPU0) (3), to start the NIC’s driver and a forced softirq[50] on the CPU at the programmable NIC (CPU1) (4). This

or the base system.

Page 7: Network interfaces for programmable NICs and multicore platforms

Fig. 3. (a) Offloading model operation; (b) onloading model operation.

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 363

softirq will process the TCP/IP stack and then will copy thedata to the TCP socket buffer (5). Finally, the CPU0 copies(6) the data from the TCP socket buffer to the applicationbuffer (7) by using socket-related receive system calls[50]. So, since an interrupt is only generated every timethe receive ring is full, we are not using the approach‘‘one interrupt per received packet” but the approach‘‘one interrupt per event”, whenever the received packetfits into the NIC’s receive ring in the same way a TOE does.

Fig. 3b provides the network interface operation withonloading. In this case, the operating system has been con-figured to execute the NIC driver in the CPU1 instead ofCPU0. Thus, once a packet is received (1) and stored inthe receive ring (2), the NIC generates an interrupt that ar-

rives at the I/O APIC (3), and the operating system launchesthe NIC driver execution in the CPU1. Thus, the corre-sponding TCP/IP thread is executed in the same CPU1. Fi-nally, after the software interruption, the operatingsystem copies the data into the user memory, (4) and (5).

Fig. 4 shows the functional models used to simulate off-loading and onloading in SIMICS. Offloading has beenimplemented by using a PCI bus directly connected tothe north bridge along with two processors (CPU0 andCPU1) also connected to the north bridge. Nevertheless,the connection between the PCI bus and CPU1 is no morethan a connector. In the same way, CPU1, which executesthe protocols, performs a fast access to the onboard mem-ory on the NIC. Thus, from a functional point of view, this is

Page 8: Network interfaces for programmable NICs and multicore platforms

364 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

equivalent to having together these two devices. As it hasbeen said, the connectors do not model any contention atsimulation time. The way to simulate the contention andtiming effects is by connecting a timing model interface[4] to each input of the bridge where access contention ispossible. Thus, an accurate simulation of the contentionbehavior is provided, as it can be seen from the experimen-tal results (Section 5).

Our offloading model uses DMA for transferring datafrom the NIC to the main memory, and the CPU1 can han-dle the data on the DMA area to process the protocol stackwith a fast access, as mentioned above, in order to simulatea fast onboard memory on the NIC.

In the offloading simulations, the interrupts generatedby the NIC directly arrive at the CPU1 without any signifi-cant delay as, in real systems, CPU1 is the processor in-cluded in the NIC that executes the network software.This ensures that no interrupt is arriving CPU0 under anycondition (i.e. whenever DMA is not used). Although SI-MICS does not support the simulation of nodes with more

Fig. 4. Functional models for offlo

than one I/O APIC, it is possible to configure them in such away that it would be possible to redirect the interruptscoming from the PCI bus towards CPU1.

For onloading (Fig. 4b), we have used two CPUs con-nected to the north bridge. The interrupts have to gothrough an APIC bus and through the I/O APIC to reachthese CPUs. This produces a delay in the interrupt propaga-tion due to the simulated interrupt controller. Moreover,this controller has to decide about the interrupts that final-ly reach the CPU.

In SIMICS, the PCI I/O and memory spaces are mappedin the main memory (MEM0 in Fig. 4). So, at hardware le-vel, transferences between these memory spaces wouldnot necessarily require a bridge, because SIMICS allowsus the definition of hardware for full-custom architectures.We add a north bridge in our architecture in order to sim-ulate a machine where can install a standard operatingsystem (i.e.: Linux). This way, the main differences be-tween the models for onloading and offloading are the fol-lowing ones:

ading (a) and onloading (b).

Page 9: Network interfaces for programmable NICs and multicore platforms

Fig. 5. Simulation models for the proposed offloading (a) and onloading (b) implementations.

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 365

(a) Whenever a packet is received, the interrupt directlyarrives at the CPU1 without having to go throughany other element. In the onloading model, theinterrupts have to go through the interrupt control-ler, the APIC bus, etc. These elements have been sim-ulated as in an SMP.

(b) In the offloading model, the NIC driver is executed inthe CPU0 that also runs the operating system.

(c) In both cases, offloading and onloading, the TCP/IPthreads are executed in the corresponding CPU1.

Fig. 5 shows all the elements that have been included inthe SIMICS simulation model for offloading (Fig. 5a) andonloading (Fig. 5b). The computers of Fig. 5 include twoCPUs, DRAM, instructions and data L1 cache, a unified L2cache, an APIC bus, a PCI bus with an attached PCI-basedgigabit Ethernet card, and a text serial console. As oursimulation models include a memory hierarchy with twocache levels, it will be also possible to get accurate conclu-sions about the real behavior of the network interfaceand the influence of the different memory levels in the

Page 10: Network interfaces for programmable NICs and multicore platforms

Fig. 6. Hybrid model operation.

366 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

communication overheads. It is important to notice thatthe gap between processor and memory performance(the memory wall problem) is even more important inpacket processing than in other usual applications [48].Nevertheless, there are not many studies about the cachebehavior of network interfaces.

In [43], a performance study of memory behavior in acore TCP/IP suite is presented. In this paper, the protocolsare implemented in the user space, as the authors cannotaccess the source code of the operating system used inthe SGI platform where the study is done. It is concludedthat, in most scenarios, instruction cache behavior has asignificant impact on performance (even higher than datacache). This situation should continue to hold for smallaverage packet sizes and zero-copy interfaces. In [43], itis also concluded that larger caches and higher associativ-

Fig. 7. Simulation models for the prop

ity improve communication performance (mainly for TCPcommunication) as many cache misses in packet process-ing are conflict misses. Moreover, it is also indicated thatnetwork protocols should scale with clock speed exceptfor cold caches (cases where caches do not store correctinformation of data and instructions at the beginning ofpacket processing) where cache performance shows animportant decrease.

The papers by Mudigonda et al. [47,48] analyze the mem-ory wall problem in the context of network processors. Paper[47] concludes that data caches can reduce the packet pro-cessing time significantly. Nevertheless, in such communi-cation applications where there is low computation permemory reference, each miss can lead to significant stalltime. Thus, in these applications, data caches should be com-plemented with other techniques to manage memory laten-

osed hybrid network interface.

Page 11: Network interfaces for programmable NICs and multicore platforms

Table 2Characteristics of the cache memories on the simulated machines.

Instruction cacheL1

Data cacheL1

Cache L2

Write policy – Write-through

Write-back

Number of lines 256 256 8096Line size (bytes) 64 64 128Associativity

(lines)2 4 8

Write back (lines) – – 1Write allocate

(lines)– – 1

Replacementpolicy

LRU LRU LRU

Read latency(cycles)

2 3 5

Write latency(cycles)

1 3 5

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 367

cies and achieve acceptable processor utilization. In thissame context, in [48] the way the memory accesses limitthe network interface performance is also analyzed. This pa-per considers the mechanisms used by network processors[31,33] to overcome the memory bottlenecks and concludesthat data caches and multithreading strategies must cooper-ate to achieve the goals of high packet throughputs and net-work interface programmability.

Fig. 7 shows the hybrid model operation. As mentionedabove, the proposed hybrid model takes advantage of off-loading and onloading techniques. In this model, CPU2 isthe processor included in the NIC and executes the com-munication protocols, CPU1 executes the driver in thesame way the onloading model does, but this CPU1 is alsoable to execute other tasks such as system calls for copying

Fig. 8. Peak throughput comparison (including the e

the data from the TCP sockets to the application buffers.The interrupts are received by CPU1, which also executesthe driver, as in the onloading alternative. So, the hybridmodel does not disturb the CPU0 while receiving data.Therefore, as the CPU0 executes the application and it isonly focused on being the data sink, the reached through-put is higher than the throughput for offloading or onload-ing cases.

In Fig. 6, after receiving a packet (1), the NIC stores it inthe ring buffer. Whenever this buffer is full or the packettransfer is finished, the packets are moved from the ringbuffer to the NIC memory (2). Then, the NIC requests ahardware interrupt to CPU1 (3). This interrupt causes theexecution of the function do_irq [50] to call the corre-sponding interrupt routine defined in the device driver tocopy the data into the sk_buff structures [50]. Then, the dri-ver starts a softirq [50] routine in CPU2 (4) to process thepackets according to the parameters of the sk_buff struc-tures. These parameters correspond, for example, to theprotocols used in the higher layers (for example IP andTCP). Once the protocol stack has been processed, the datais moved to the TCP socket (5) where they can be accessedby the application, (6) and (7).

Fig. 7 shows the simulation model corresponding to thehybrid network interface. Moreover, in this hybrid alterna-tive, the NIC is directly connected to the north bridge (ordirectly to the multicore chip) to avoid a bottleneck onthe PCI bus and also includes one level of cache memory.This feature is implemented in the simulation model by fit-ting the time models of cache and PCI bus. As in the off-loading model, we are using a NIC that contains anonboard memory that serves as a receive ring. So, althoughthe interrupts are not arriving at the main host CPU (CPU0),the ‘‘per event interrupt” approach aids to reduce the inter-

ffect of the LAWS parameter a for offloading).

Page 12: Network interfaces for programmable NICs and multicore platforms

(a)

(b)Fig. 9. Effect of the cache hierarchy in the throughput improvement: (a) comparison of throughputs with and without caches; (b) effect of the increase inthe L1 data cache size.

368 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

Page 13: Network interfaces for programmable NICs and multicore platforms

Table 3Network latency for different interfaces and cache hierarchy configurations.

Interface Without detailedcache model (ls)

With detailed 2-level cachemodel (32KB L1 cache/512KBcache) (ls)

Basesystem

68 38

Offloading 60 60Offloading

localcache

36 34

Onloading 64 36

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 369

rupts generated to the CPU1. As a result, there are moreidle cycles on CPU1 and it is possible to use them to exe-cute some application threads, although through prioritiz-ing the communication tasks (i.e. the driver execution).

5. Experimental results

In our SIMICS models, we have used Pentium 4 proces-sors running at 400 MHz as CPU0 (this frequency is enoughto have up to 1 Gbps at link level and not to slow down thesimulation speed). We have also included in our system a128 MB DRAM, an APIC bus, a 64 bit PCI bus with a Ti-gon-3 (BCM5703C) gigabit Ethernet card attached, and atext serial console. The characteristics of the cache memo-ries used in the models are provided in Table 2.

Figs. 8 and 9 provide our first throughput comparisonsbetween onloading and offloading. They have been ob-tained by using TCP as transport protocol and netpipe[24] as benchmark. This test measures the network perfor-mance in terms of the available throughput between two

Fig. 10. Comparison of signature cur

hosts and consists of two parts, a protocol independent dri-ver and a protocol specific communication section thatimplements the connection and transfer functions. Foreach measurement, netpipe automatically increases theblock size (see [24] for details).

The simulations corresponding to the results of Figs. 8and 9 have been carried out by using low application work-loads. In these experiments, there is almost no workloadapart from the communication tasks. This initial situationallows us to validate the simulation models and the timingeffects.

Fig. 8 shows that both onloading and offloading causeimprovements in the peak throughput, although theseimprovements depend on the size of the messages. Fig. 8also provides the throughput curves for offloading simula-tions that consider different speeds in the processor at theNIC (CPU1) with respect to the host processor (CPU0).Thus, the parameter a (lag ratio) in the LAWS model, whichconsiders the ratio between the cycle times of the CPU andthe NIC, is equal to 1 whenever CPU0 and CPU1 have thesame speed. As the CPU1 becomes slower than the hostCPU, the parameter a decreases (i.e., a ¼ 0:5 means thatthe host CPU is twice faster than the CPU at the NIC), andthe improvement obtained by offloading also decreases,as Fig. 8 shows. Moreover, in the case of slower CPUs atthe NIC, the throughput could be even worse than in anon offloaded system. This circumstance can be explainedfrom the LAWS model (see the db expression on Section 3),and it can also be observed in Fig. 8, that shows how thepeak throughput decreases as a is reduced.

In the experiments corresponding to the results ofFig. 8, the cache hierarchy has been disabled. Fig. 9a andb show the effect of caches L1 and L2 in the throughput.

ves using different time scales.

Page 14: Network interfaces for programmable NICs and multicore platforms

370 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

From the Fig. 8.a it is clear that caches improve thethroughput provided by each network implementationalternative. Nevertheless, the amount of this improvementis different for each case. Thus, when caches are used, theinterface in the base system and the onloaded interfaceshow higher improvements than the offloaded networkinterface. This can be explained by taking into accountthat, as the interface is mainly processed in the NIC in thiscase, the presence of a cache in the node almost does notaffect its performance because the CPU in the NIC usesits local memory. Fig. 9b shows that the increase in thecache size (L1 data cache size) only produces a slightlythroughput improvement in case of our onloadingimplementation.

The benefits of onloading and offloading should affectnot only throughput but also message latency. Fig. 8 showsthe signature curves [24] for onloading, offloading and thebase system when caches are disabled. From this curve, itis possible to derive the network latencies shown inFig. 9, since it corresponds to the first point in the timeaxis. We have also obtained the signature curves corre-sponding to different cache configurations to obtain thenetwork latency in these cases. They are provided in Table3. As it can be seen from this table, the cache hierarchycontributes to reduce the latencies more in the base andthe onloaded interfaces than in the offloaded interface.This situation is similar to that observed in the through-puts. Thus, the onloading strategy provides better latencieswhenever a cache hierarchy is present in the node, as it isusual (see Fig. 10).

Fig. 11. Comparison of message latencies considering

Fig. 11 provides the behavior of the latencies with re-spect to the message sizes and different cache hierarchyconfigurations. It can be seen that, if the caches are dis-abled, the latencies are lower for offloading. As the proto-col is offloaded to the network interface, it can interactwith the network through less I/O buses transferences.Nevertheless, the presence of caches changes this situationand a higher reduction in the latencies for the base and theonloaded interfaces than in the offloaded one is shown. Onthe other hand, the use of local cache in the NIC with off-loading, reduces the latency down to a lower value thanonloading or the base system when using cache. These re-sults demonstrate the known fact that the characteristicsof the interaction between the network interface and thememory hierarchy of the node have an important role inthe communication performances.

In the experimental results which we have presentedup to now, we have mainly compared the behavior of theonloading and offloading strategies against changes inthe size of the messages. We are also interested on the per-formance as the application workload changes. The LAWSmodel has allowed us to organize our exploration of thespace of alternatives for onloading and offloading.

As it has been said, the LAWS model provides a way tounderstand the performance behavior under differentapplication workloads and technological characteristics ofthe processors in the system. The characteristics of theapplication with respect to communication requirementsis considered by using the rate of application workload tocommunication overhead, c, and the technological impact

different configurations of the cache hierarchy.

Page 15: Network interfaces for programmable NICs and multicore platforms

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 371

is taken into account through the lag ratio, a. In LAWS, themessage size can be taken into account through the com-munication overhead in the application ratio, c, andthrough the wire ratio, r (as the message size also affectsthe throughput provided by the host). Nevertheless, inour experiments, we use a given message size while wechange the application workload with respect to the com-munication overhead to get the different values for theparameter c. In this way, Fig. 11a and b compare the resultsobtained from our simulations with those predicted byLAWS (onloading LAWS and offloading LAWS curves in thefigures) for message sizes of 2 Mbits and 8 Kbits, respec-tively. These figures provide the curves corresponding to

Fig. 12. LAWS model vs. SIMICS simulation results

the peak throughput improvement, db (introduced in Sec-tion 3), against c, for onloading and offloading, and usingTCP as transport protocol. In the experiments, the valuesof b and p depend on the characteristics of our implemen-tations of onloading and offloading. We have b ¼ 0:4 andp ¼ 0:55 for offloading, and b ¼ 0:45 and p ¼ 0:75 foronloading. The values of the other parameters areB = 1 Gbps, X = 1 GHz, and Y = 1 GHz.

The different performances that LAWS model predictsfor onloading and offloading (as it is shown, respectively,in onloading LAWS and offloading LAWS curves of Fig. 12aand b) can be explained by the differences in their valuesof p. Our onloading implementation allows a higher value

for messages of 2 Mbits (a) and 8 Kbits (b).

Page 16: Network interfaces for programmable NICs and multicore platforms

372 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

for p (p ¼ 0:75Þ than the offloading implementationðp ¼ 0:55Þ while they have similar values for bðb ¼ 0:45and b ¼ 0:4).

From Fig. 12, it is clear that the LAWS model representsan upper bound to the throughput improvement providedby onloading and offloading strategies. Moreover, LAWSmodel also gives useful qualitative information about theevolution of the curves with the application ratio (all thecurves present a similar shape, i.e. the improvement risesas c grows up to a maximum value and then decreasessoftly). Nevertheless, although we have observed similarqualitative behavior between our simulation results andthe predictions of LAWS, there are significant quantitativedifferences.

By comparing Fig. 12a and b, the influence of messagesize on the throughput improvement is clear. FromFig. 12a, we can conclude that without cache, both onload-ing and offloading cause almost the same improvementunder low application rate (low c). As application rate in-creases (more application workload with respect to thecommunication overhead):

Fig. 13. Comparison of CPU uti

Fig. 14. Comparison of interr

(i) The improvement curve for onloading grows fasterthan the improvement curve for onloading.

(ii) The maximum achieved improvement is higher foronloading than for offloading.

(iii) In case of high application rates (as the applicationworkload grows with respect to the communicationoverhead), the throughput improvement decreasesmore slowly for onloading than for offloading. Thismeans that the application rate for collapsing canbe higher for a node with onloading than withoffloading.

The experiments regarding the LAWS model have beenperformed using a modified version of hycbench [26] in or-der to be able to change the application ratio.

Either with cache or without cache, the experimentalresults of Fig. 12b show that the improvements for off-loading and onloading are quite similar except for thelowest values of the application ratio, c. The differencebetween this behavior and the one observed inFig. 12a for large messages can be explained by taking

lization (with hpcbench).

upts vs. message size.

Page 17: Network interfaces for programmable NICs and multicore platforms

Fig. 15. Comparison of throughputs for the hybrid model.

Fig. 16. Comparison of latencies for the hybrid model.

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 373

into account the characteristics of the routine for imple-menting the copies in memory (memcpy), which is more

efficient for message sizes between 1 KBytes and 100KBytes [49].

Page 18: Network interfaces for programmable NICs and multicore platforms

374 A. Ortiz et al. / Computer Networks 54 (2010) 357–376

The use of a cache hierarchy shifts the maximumthroughput improvement towards lower values of c, eitherfor large or short messages, as is shown in Fig. 11a and b,respectively. This can be explained by a higher reductionin the application workload with respect to the communi-cation overhead due to a relative decrease in the latenciesrequired for memory accesses from CPU0.

Fig. 13 shows the host CPU utilization for offloading,onloading, and the base system at different message sizes.The results have been obtained using oprofile [53] asbenchmark. In the simulations, we have used a very inten-sive communication workload. As we can see, sinceonloading allows the NIC driver to be executed in CPU1,the CPU0 load is even lower than in the offloading case.In this case, although all the protocol processing is offload-ed to the NIC, the driver is executed in the host CPU (CPU0)and this causes a higher load.

Fig. 14 shows the interrupts attended by CPU0 consider-ing different message sizes. The number of interrupts hasbeen obtained by counting them along a given amount oftime (one second). In all our SIMICS simulations (base sys-tem, offloading, and onloading) we have included themechanisms to reduce interrupts that are currently com-mon in the NICs, such as Jumbo frames and interrupt coa-lescing. This is the reason for which the number ofinterrupts for offloading and non-offloading is very similar.In [19], the provided decrease in the interrupts per secondobtained by offloading with respect to non-offloading wasabout 60% for TCP, and about 50% for UDP. In these simu-lations, the modeled NICs do not include either Jumboframes or interrupt coalescing. Fig. 14 also demonstratesthe high reduction in the number of interrupts receivedby CPU0 with onloading. In this case, CPU1 executes allthe communication software, including the NIC driver. Inthe case of offloading and the base system, the numberof interrupts decreases with the size of the messages asthe number of received messages decreases.

From the shown experimental results, it is clear thatonloading and offloading offer different advantages thatcan be jointly exploited and drawbacks that can beavoided if we are able to take advantage of processorsin the NIC and other additional processors with the sameprivileges for memory accesses as the host processor(CPU0). Thus, we have implemented a network interfacethat hybridizes some onloading and offloading strategies.Fig. 15 shows that our hybrid interface provides betterpeak throughput improvement than either onloading oroffloading. Moreover, from Fig. 16 the fact that the la-tency achieved by this hybrid scheme is almost the samethat the lowest latency, obtained by offloading with localcache, can be concluded.

6. Conclusions

In this paper, we have analyzed two alternatives,onloading and offloading, proposed to take advantage ofdifferent processors available in the node in order to im-prove the communication performance. In this way, wehave built simulation models and implementations of bothtechniques to evaluate and compare their behavior by

using the full-system simulator SIMICS. Moreover, we havestudied the conditions in which, according to the LAWSmodel, these two techniques improve the communicationperformance. Thus, the usefulness of the LAWS model todrive the experimental work and to analyze the obtainedresults is also shown.

With respect to the behavior of onloading and offload-ing, we have shown that, although both strategies contrib-ute to reduce the number of interrupts received by the hostCPU and the host CPU utilization by communication tasks,the best results correspond to onloading.

The relative improvement on throughput offered by off-loading and onloading depends on the rate of applicationworkload to communication overhead of the implementa-tion, the message sizes, and on the characteristics of thesystem architecture. It is also shown that, in our imple-mentations, onloading provides better throughputs thanoffloading whenever realistic conditions are consideredwith respect to the application workload (as the values ofthe LAWS parameter c grows).

Offloading gives the best results with respect to themessage latencies in the simulations done without cache,and cache does not suppose an improvement in latencywhen offloading is used, as the processor in the NIC doesnot use it. Nevertheless, cache produces an importantreduction in the latencies for onloading and in the basecase as well as in the offloading case when a local cacheis included in the NIC. This reduction implies that eventhe base case (without offloading) is better than offload-ing (without local cache) with respect to latency. Thiscircumstance demonstrates the relevance of the interac-tion between the network interface and the memoryhierarchy.

Thus, we have also proposed a hybrid network interfacethat tries both to take advantage of the better features ofonloading and offloading and to avoid their drawbacks.Our simulation results show that this interface improvesboth the throughput improvements achieved by eitheronloading or offloading schemes and present latencies al-most as good as the best results, obtained by using the net-work interface implementing the onloading alternative.

We consider that thanks to the SIMICS simulation mod-els which we have developed, it is possible to collect de-tailed information about the different system events tomake more accurate analyses of the different improvementstrategies that would be proposed in the future. In thisway, we plan to proceed with our analysis of networkinterfaces by using more benchmarks and some realapplications.

Acknowledgements

This work has been funded by projects TIN2007-60587(Ministerio de Ciencia y Tecnología, Spain) and TIC-1395(Junta de Andalucía, Spain).

References

[1] P.S. Magnusson et al., Simics: a full system simulation platform, IEEEComputer (2002) 50–58.

Page 19: Network interfaces for programmable NICs and multicore platforms

A. Ortiz et al. / Computer Networks 54 (2010) 357–376 375

[2] N.L. Binkert, E.G. Hallnor, S.K. Reinhardt, Network-oriented full-system simulation using M5, in: Sixth Workshop on ComputerArchitecture Evaluation using Commercial Workloads (CECW),February 2003.

[3] M5 simulator system Source Forge page, <http://sourceforge.net/projects/m5sim>.

[4] Virtutech web page, <http://www.virtutech.com/>.[5] R. Westrelin et al., Studying network protocol offload with

emulation: approach and preliminary results, in: Proceedings of the12th Annual Symposium IEEE on High Performance Interconnects,2004, pp.84–90.

[6] J.C. Mogul, TCP offload is a dumb idea whose time has come, in:Ninth Workshop on Hot Topics in Operating Systems (HotOS IX),2003.

[7] G. Regnier et al., TCP onloading for data center servers, IEEEComputer (2004) 48–58.

[8] D.D. Clark et al., An analysis of TCP processing overhead, IEEECommunications Magazine 7 (6) (1989) 23–29.

[9] M. O’Dell, Re: how bad an idea is this, Message on TSV Mailing List(2002).

[10] P. Shivam, J.S. Chase, On the elusive benefits of protocol offload, in:SIGCOMM’03 Workshop on Network-I/O Convergence: Experience,Lesons, Implications (NICELI), August 2003.

[11] P. Gilfeather, A.B. Maccabe, Modeling protocol offload for message-oriented communication, in: Proceedings of the 2005 IEEEInternational Conference on Cluster Computing (Cluster 2005), 2005.

[12] M.M. Martin et al., Multifacet’s general execution-driven multi-processor simulator (GEMS) toolset, Computer Architecture News(CAN) (2005).

[13] C.J. Mauer et al., Full-system timing-first simulation, ACM SigmetricsConference on Measurement and Modeling of Computer Systems(2002) June.

[14] M. Rosenblum et al., Using the SimOS machine simulator to studycomplex computer systems, ACM Transactions on Modeling andComputer Simulation 7 (1) (1997) 78–103.

[15] H.-Y. Kim, S. Rixner, TCP offload through connection handoff, ACMEurosys’06 (2006) 279–290.

[16] <http://www.broadcom.com/>, 2007.[17] <http://www.chelsio.com/>, 2007.[18] <http://www.neterion.com/>, 2007.[19] A. Ortiz, J. Ortega, A.F. Díaz, A. Prieto, Protocol offload evaluation

using Simics, IEEE Cluster Computing, Barcelona (2006).[20] Competitive Comparison, Intel I/O Acceleration Technology vs. TCP

Offload Engine, <http://www.intel.com/technology/ioacceleration/316126.pdf>.

[21] B. Wun, P. Crowley, Network I/O acceleration in heterogeneousmulticore processors, in: Proceedings of the 14th AnnualSymposium on High Performance Interconnects (HotInterconnects), August 2006.

[22] Intel I/O Acceleration Technology, <http://www.intel.com/technology/ioacceleration/index.htm>.

[23] K. Vaidyanathan, D.K. Panda, Benefits of I/O Acceleration Technology(I/OAT) in Clusters, Technical Report Ohio State University(OSU_CISRC-2/07-TR13).

[24] Q.O. Snell, A. Mikler, J.L. Gustafson, NetPIPE: a network protocolindependent performance evaluator, in: IASTED InternationalConference on Intelligent Information Management and Systems,June 1996.

[25] G. Andrew, C. Leech, Accelerating network receiver processing,<http://linux.inet.hr/files/ols2005/grover-reprint.pdf>.

[26] B. Huang, M. Bauer, M. Katchabaw, Hpcbench – a Linux-basednetwork benchmark for high performance networks, in: 19thInternational Symposium on High Performance ComputingSystems and Applications (HPCS’05), 2005.

[27] G. Ciaccio, Messaging on gigabit Ethernet: some experiments withGAMMA and other systems, in: Workshop on CommunicationArchitecture for Clusters, IPDPS, 2001.

[28] A.F. Díaz, J. Ortega, A. Cañas, F.J. Fernández, M. Anguita, A. Prieto, Alight weight protocol for gigabit Ethernet, in: Workshop onCommunication Architecture for Clusters (CAC’03), IPDPS’03, April2003.

[29] R.A.F. Bhoedjang, T. Rühl, H.E. Bal, User-level network interfaceprotocols, IEEE Computer (1998) 53–60.

[30] Virtual Interface Developer Forum, VIDF, 2001, <http://www.vidf.org/>.

[31] S. Chakraborty et al., Performance evaluation of network processorarchitectures: combining simulation with analytical estimation,Computer Networks 41 (2003) 641–645.

[32] L. Thiele et al., Design space exploration of network processorarchitectures, in: Proceedings of the First Workshop on NetworkProcessors (en el 8th International Symposium on High PerformanceComputer Architecture), February 2002.

[33] I. Papaefstathiou et al., Network processors for future high-endsystems and applications, IEEE Micro (2004).

[34] S. GadelRab, 10-Gigabit Ethernet connectivity for computer servers,IEEE Micro (2007) 94–105.

[35] V. Karamcheti, A.A. Chien, Software overhead in messaging layers:where does time go?, in: Proceedings of ASPLOS-VI, San Jose(California), October 5–7, 1994.

[36] J.S. Chase, A.J. Gallatin, K.G. Yocum, End system optimizations forhigh-speed TCP, IEEE Communications (2001) 68–74.

[37] O. Maquelin et al., Polling watchdog: combining polling andinterrupts for efficient message handling, in: Proceedings of theInternational Symposium Computer Architecture, IEEE CS Press,1996, pp. 179–188.

[38] M. Welsh et al., Memory management for user-level networkinterfaces, IEEE Micro (1998) 77–82.

[39] H. Tezuka et al., Pin-down Cache: a virtual memory managementtechnique for zero-copy communication, in: Proceedings of theInternational Parallel Processing Symposium, IEEE CS Press, 1998,pp. 308–314.

[40] A. Foong et al., TCP performance revisited, in: Proceedings of the IEEEInternational Symposium Performance Analysis of Software andSystems, IEEE CS Press, 2003, pp. 70–79.

[41] H. Kim, S. Rixner, V.S. Pai, Network interface data caching, IEEETransactions on Computers 54 (11) (2005) 1394–1408.

[42] K. Lauritzen, T. Sawicki, T. Stachura, C.E. Wilson, Intel I/Oacceleration technology improves network performance, reliabilityand efficiency, Technology@Intel Magazine (2005) 3–11.

[43] E.M. Nahum, D. Yates, J.F. Kurose, D. Towsley, Cache behaviour ofnetwork protocols, in: SIGMETRICS’97, 1997.

[44] E.M. Nahum, D.J. Yates, J.F. Kurose, D. Towsley, Performance issues inparallelized network protocols, Proceedings of the OperatingSystems Design and Implementation (1994) 125–137.

[45] H. Kim, V.S. Pai, S. Rixner, Exploiting task-level concurrency in aprogrammable network interface, in: Proceedings of the ACMPPoPP’03, 2003.

[46] M. Brogioli, P. Willman, S. Rixner, Parallelization strategies fornetwork interface firmware, in: Proceedings of 4th Workshop onOptimization for DSP and Embedded Systems, ODES-4, 2006.

[47] J. Mudigonda, H.M. Vin, R. Yavatkar, Managing memory accesslatency in packet processing, in: SIGMETRICS’05, 2006, pp. 396–397.

[48] J. Mudigonda, H.M. Vin, R. Yavatkar, Overcoming the memory wall inpacket processing: hammers or ladders?, in: Proceedings of the ACMANCS’05, 2005.

[49] D. Turner, A. Oline, X. Chen, T. Benjegerdes, Integrating newcapabilities into NetPIPE, in: 10th European PVM/MPI User’s GroupMeeting, 2003.

[50] C. Benvenuti, Understanding Linux Network Internals, O’ReillyMedia Inc., 2005.

[51] P. Balaji, W. Feng, D.K. Panda, Bridging the Ethernet–Ethernotperformance gap, IEEE Micro (2006) 24–40.

[52] Alteon Websystems, Extended Frame Sized for Next GenerationEthernets, <http://staff.psc.edu/mathis/MTU/AlteonExtendedFrames_W0601.pdf>.

[53] Oprofile, A system profiler for Linux, <http://oprofile.sourceforge.net>.

Andrés Ortiz received the M.Sc. degree inelectronics in 2000, and the Ph.D. degree in2008, both from the University of Granada.From 2000 to 2005 he was working as Sys-tems Engineer with Telefonica, Madrid,Spain, where his work areas were high per-formance computing and network perfor-mance analysis. Since 2004 he has been withthe Department of Communication Engi-neering at the University of Malaga as anAssistant Professor. His research interestsinclude high performance networks, mobile

communications, RFID and embedded power-restrained communicationdevices.

Page 20: Network interfaces for programmable NICs and multicore platforms

376 A. Ortiz et al. / Computer Netwo

Julio Ortega received the B.Sc. degree inelectronic physics in 1985, the M.Sc. degree inelectronics in 1986, and the Ph.D. degree in1990, all from the University of Granada,Spain. His Ph.D. dissertation has received theAward of Ph.D. dissertations of the Universityof Granada.He was at the Open University, UK, and at theDepartment of Electronics (University ofDortmund, Germany), as invited researcher.Currently he is a Full Professor at theDepartment of Computer Architecture and

Technology of the University of Granada. His research interest are in thefields of parallel processing and parallel computer architectures, artificialneural networks, and evolutionary computation. He has led research

projects in the areas of networks and parallel architectures, and parallelprocessing for optimization problems.

Antonio F. Díaz received the M.Sc. degree inelectronic physics in 1991, and the Ph.D.degree in 2001, all from the University ofGranada, Spain.He is currently an Associate Professor in theDepartment of Computer Architecture andComputer Technology. His research interestsare in the areas of network protocols, dis-tributed systems and network area storage.

Alberto Prieto earned his BSc in Physics(Electronics) in 1968 from the ComplutenseUniversity in Madrid. In 1976, he completed aPh.D. at the University of Granada. From 1971to 1984 he was founder and Head of theComputing Centre, and he headed ComputerScience and Technology Studies at the Uni-versity of Granada from 1985 to 1990. He iscurrently a fulltime Professor and Head of theDepartment of Computer Architecture andTechnology. Is the co-author of four text-books published by McGraw-Hill and Thom-

son editorials, has co-edited five volumes of the LNCS, and is co-author ofmore than 250 articles. His area of research primarily focuses on intelli-gence systems.

rks 54 (2010) 357–376