HyperNF: Building a High Performance, High Utilization and...

HyperNF: Building a High Performance, High Utilization andFair NFV Platform

Kenichi YasukataNEC Laboratories [email protected]

Felipe HuiciNEC Laboratories [email protected]

Vincenzo MaffioneUniversita di Pisa

[email protected]

Giuseppe LettieriUniversita di Pisa

[email protected]

Michio HondaNEC Laboratories [email protected]

ABSTRACT

Network Function Virtualization has been touted as thesilver bullet for tackling a number of operator problems,including vendor lock-in, fast deployment of new function-ality, converged management, and lower expenditure sincepacket processing runs on inexpensive commodity servers.The reality, however, is that, in practice, it has proved hardto achieve the stable, predictable performance provided byhardware middleboxes, and so operators have essentially re-sorted to throwing money at the problem, deploying highlyunderutilized servers (e.g., one NF per CPU core) in order toguarantee high performance during peak periods and meetSLAs.

In this work we introduce HyperNF, a high performanceNFV framework aimed at maximizing server performancewhen concurrently running large numbers of NFs. To achievethis, HyperNF implements hypercall-based virtual I/O, plac-ing packet forwarding logic inside the hypervisor to signifi-cantly reduce I/O synchronization overheads. HyperNF im-proves throughput by 10%-73% depending on the NF, is ableto closely match resource allocation specifications (with devi-ations of only 3.5%), and to efficiently cope with changingtraffic loads.

CCS CONCEPTS

• Networks → Middle boxes / network appliances;Routers; • Software and its engineering → Communica-tions management ;

KEYWORDS

Hypervisor, Middlebox, NFV

Permission to make digital or hard copies of all or part of this workfor personal or classroom use is granted without fee provided thatcopies are not made or distributed for profit or commercial advantageand that copies bear this notice and the full citation on the first page.Copyrights for components of this work owned by others than the au-thor(s) must be honored. Abstracting with credit is permitted. To copyotherwise, or republish, to post on servers or to redistribute to lists,requires prior specific permission and/or a fee. Request permissionsfrom [email protected].

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

© 2017 Copyright held by the owner/author(s). Publication rightslicensed to Association for Computing Machinery.ACM ISBN 978-1-4503-5028-0/17/09. . . $15.00https://doi.org/10.1145/3127479.3127489

ACM Reference Format:Kenichi Yasukata, Felipe Huici, Vincenzo Maffione, Giuseppe Let-

tieri, and Michio Honda. 2017. HyperNF: Building a High Perfor-mance, High Utilization and Fair NFV Platform. In Proceedings

of SoCC ’17, Santa Clara, CA, USA, September 24–27, 2017,

14 pages.https://doi.org/10.1145/3127479.3127489

1 INTRODUCTION

The promise of Network Function Virtualization (NFV) was ashift from expensive, difficult to manage and modify hardwaremiddleboxes to a world of software-based packet processingrunning on general purpose OSes and inexpensive off-the-shelfx86 servers; this change came with several benefits for oper-ators, including converged management, no vendor lock-in,functionality that was much easier to upgrade accompaniedby much shorter deployment cycles, and reduced investmentand operational costs.

Despite its advantages, the reality is that NFV platformshave unpredictable performance: the sharing of commonresources such as last-level caches and CPU cores means thatwhile the system may run reliably when a single or a fewnetwork functions (NFs) are running, under high load theplatform may experience reduced throughput and high packetloss. Worse, its performance may be highly dependent onthe traffic matrix and the actual NFs running on the system:for instance, a Deep Packet Inspection (DPI) function willconsume many cycles on a packet with a suspicious payload,but will be cheap for “normal” packets.

Because of this, operators, for whom the imperative is toprovide predictable performance and comply with ServiceLevel Agreements (SLAs), have to opt for over-provisioningNFV platforms, often by assigning an entire CPU core to asingle NF. This peak-level provisioning is one of the reasonshardware middleboxes are expensive, and results in severeunder-utilization when load drops. This drop is significant:network traffic has a large gap between peak load to averageload (five-fold), and between peak load and minimum load(ten-fold), and this gap is steadily increasing [6]. In addition,load peaks are rare (e.g., 1 hour on any given day): if weprovision for peak performance, we will need five times moreresources, on average, than a system that adapts to load.

https://doi.org/10.1145/3127479.3127489

https://doi.org/10.1145/3127479.3127489

SoCC ’17, September 24–27, 2017, Santa Clara, CA, USA K. Yasukata, F. Huici, V. Maffione, G. Lettieri, and M. Honda

On the research side, the work has largely focused onachieving high throughput. One of the most common tech-niques has been to leverage kernel bypass packet I/O frame-works. NetVM [17], for instance, uses DPDK [18] to obtainhigh performance, but suffers from rather poor CPU utiliza-tion due to DPDK’s reliance on busy polling. As anotherpoint in that space, ClickOS [24] opts for the netmap frame-work [32] and unikernels to obtain high throughput, but usespinning and CPU cores dedicated to I/O (the latter is alsotrue of NetVM) for high performance, which hurts utilizationand adaptability to changing traffic loads.

What is missing is an NFV platform that is able to provideboth high throughput and high utilization, all the whileallowing for accurate resource allocation and the adaptabilityto changing traffic loads that would make NFV deployments abetter value proposition. In this work we introduce HyperNF,a high performance NFV platform implemented on the Xen [3]hypervisor that meets these requirements. Our contributionsare:

∙ An investigation of the root causes behind the difficultyof simultaneously obtaining high throughput, utiliza-tion and adaptability to changing loads. We find thatusing a “split” model, with cores dedicated to I/O iswasteful and leads to unfairness; a “merged” model(VM and I/O processing happen on the same core) isbetter but has high synchronization overheads.

∙ The introduction of hypervisor-based virtual I/O, amechanism whereby I/O is carried out within the con-text of a VM, reducing I/O synchronization overheadswhile improving fairness.

∙ Efficient Service Function Chaining (SFC), with higherthroughput than baseline setups and delay of as littleas 2 milliseconds for a chain of 50 NFs.

∙ A thorough performance evaluation of the system. Onan inexpensive x86 server, HyperNF improves through-put by 10%-73% depending on the NF, is able to closelymatch specified resource allocations (with deviationsof only 3.5%).

To show the generality of HyperNF’s architecture, wefurther implement it on Linux QEMU/KVM [4] by leveragingthe ptnetmap software [9, 23]. We release the Xen-basedversion of HyperNF as open source, downloadable from https://github.com/cnplab/HyperNF.

2 REQUIREMENTS AND PROBLEMSPACE

In order to improve the current status quo of over-provisionedNFV platform deployment, we would like our system tocomply with a number of performance requirements 1. Inparticular, in order to reduce costs and to be able to provideaccurate resource allocation and meet SLAs we would need:

∙ High Utilization: Ensuring that CPU cycles are nei-ther wasted (e.g., by busy polling when no packets are

1Other functional requirements such as ease of management and faulttolerance are outside the scope of this paper.

0

1

2

3

10:9020:80

30:7040:60

50:5060:40

70:3080:20

90:10

Thro

ughp

ut [M

pps]

Resource Partition Ratio of vCPU:vI/O [%]

rule10-64Brule10-1024B

rule50-64Brule50-1024B

Figure 2: Throughput when forwarding trafficthrough a firewall VM for different static CPU as-signments to the VM versus the I/O threads, dif-ferent packet sizes, and different number of firewallrules. No assignment yields the highest throughputfor all cases.

available) nor idle (e.g., by dedicating cores to I/Owhen they could be used by VMs).

∙ Adaptability: The platform should be able to dynam-ically assign idle resources in order to cope with peaksin traffic.

∙ High Throughput: To provide not only high per-NFthroughput but high cumulative throughput for theplatform as a whole.

∙ Accurate Resource Allocation: VMs and thepacket processing they need to carry out shouldcomply with the amount of resources allotted by theplatform’s operator (e.g., NF1 should use 30% of aCPU core, NF2 50%, etc.) with only minor deviations.

Next, we take a look at each of these in turn, investigatingthe problem space and especially the issues that preventcurrent virtual networking techniques to achieve these re-quirements.

2.1 High Utilization and Adaptability

Given a set of CPU cores, a number of VM threads, and anumber of I/O threads, there are a number of ways that wecan assign the threads to the available cores (see Figure 1).One of the most common models is what we term “split”(Figure 1a), whereby some cores are solely dedicated to I/Othreads and others are given to VM threads 2. The ideahere is to maximize throughput by parallelizing I/O and VMprocessing: while a VM/NF is busy processing packets (e.g.,modifying a header for a NAT), a separate core running anI/O thread is in charge of sending and receiving other packets.Unfortunately, dedicating cores to particular tasks reducesutilization when those tasks are idle (e.g., an expensive setof regexs in a DPI need to be applied by an overloadedCPU core while an I/O core is idle because there are noincoming/outgoing packets to service).

2This is the model used by systems like ClickOS and NetVM.

https://github.com/cnplab/HyperNF

https://github.com/cnplab/HyperNF

HyperNF: Building a High Performance, High Utilization and Fair NFV PlatformSoCC ’17, September 24–27, 2017, Santa Clara, CA, USA

VM1 I/OVM1

CPU1 CPU2 CPU3

VM2 I/O

VM3 I/OVM2

VM3

VM4 I/O

VM5 I/O

VM6 I/O

VM4

VM5

VM6

(a) Split

VM1

CPU1 CPU2 CPU3

VM1 I/O

VM2

VM2 I/O

VM3

VM3 I/O

VM4

VM4 I/O

VM5

VM5 I/O

VM6

VM6 I/O

(b) Merge

VM1

CPU1 CPU2 CPU3

VM3 I/O

VM2

VM6 I/O

VM3

VM5 I/O

VM4

VM2 I/O

VM5

VM1 I/O

VM6

VM4 I/O

(c) Mix

VM1

CPU1 CPU2 CPU3

VM1 I/O

VM2

VM2 I/O

VM3

VM3 I/O

VM4

VM4 I/O

VM5

VM5 I/O

VM6

VM6 I/O

(d) HyperNF

Figure 1: Different models for assigning CPU cores to VMs and I/O processing. Split uses different CPU coresfor I/O and VMs, merge runs a VM and its corresponding I/O on the same core, and mix lets the schedulerassign resources.

This is essentially a static allocation of resources, which isinefficient and exacerbated by changing traffic conditions andthe type of NFs currently running on the system. To quantifythis effect, we conduct a simple experiment on a 14-coreserver. First, we assign a single CPU core to a VM runningnetmap-ipfw, a Netmap-based firewall [33], and another coreto all I/O threads in dom0 (in our case this means all I/Othreads of the VALE [15, 35] backend software switch). Thisconfiguration is identical to experiments in the ClickOS [24]paper. We then limit the number of cycles available to theVM and I/O threads by using Xen’s credit scheduler’s capcommand and Linux’s cgroup, respectively, in order to showwhat the effects of different static assignments are on perfor-mance.

With this in place, we then generate traffic using netmap’spkt-gen running on dom0, forward the traffic through thefirewall VM, and count the resulting rate in another pkt-genrunning also on dom0. We run this experiment for 64-byteand 1024-byte packets and a 10 and 50-rule firewall wherepackets have to traverse all rules. Figure 2 shows the resultsfor various static assignments, all of which add up to 100% sothat they can be directly compared. What’s apparent fromthe graph is that no single assignment is ideal for all setups; inother words, static assignments, as is the norm with the splitmodel, are practically guaranteed to under-utilize systemresources and lead to sub-optimal performance.

An alternative to this is the “merge” model (see Figure 1b),in which a VM and its corresponding I/O thread runs onthe same CPU core. In principle, in terms of high utilization,the merge model is better than the split one, since, assuminga work-conserving scheduler, idle CPU cycles can be usedby the VM or the I/O thread as needed (i.e., there is nolonger a static separation of resources between VM andI/O). Consequently, in order to meet the high utilization

requirement we opt for the merge model. In addition, becausethis model allows the system to share CPU cycles betweenI/O and VM/packet processing as needed, it can much betteradapt to changing traffic conditions, as we will show later on.Having said that, this model comes with its own problems,which we explore and quantify next.

2.2 High Throughput

The drawback of the merge model are the high overheadsrelated to the fact that VMs have to be frequently preemptedin order for the I/O thread to execute. Such switching includesan expensive VMEXIT instruction, along with thread hand-off and synchronization mechanisms between the VM andthe I/O thread (e.g., for the VM to notify the I/O threadthat it has placed packets in a ring buffer).

To measure these overheads, we implement a simple eventnotification split driver in Xen which creates a shared memoryregion between a Linux guest (the VM) and a kernel threadin the Linux-based driver domain (the I/O thread). In thetest, the VM begins by setting a variable in the sharedmemory region to 1 and kicking an event to wake up thekernel thread. When Xen schedules the driver domain andthen Linux schedules the kernel thread, the thread sets thevariable to 0 and goes back to sleep. Finally, the VM wakesup and checks that the value is 0, and, if it is, sets the variableback to 1. This setup mimics the synchronization procedureof the merge model while reducing any extraneous costs to aminimum, thus measuring context switch costs.

Using the same x86 server as before, we carry out theprocedure above one million times and measure the results.For this test, we use the credit2 scheduler and disable its ratelimit feature. On average, it takes 26.306 microseconds tocomplete a single synchronization “round-trip” (for reference,


0 200 400 600 800

1000 1200

32 64 128 256 512 1024Back

end-

side

Batc

h [#

]

Application Batch Size [#]

(a) Specified application batch size versus actual mea-sured batch size in the bridge. Error bar is the standard

deviation.

0 50

100 150 200 250 300

32 64 128 256 512 1024

Tim

e [n

s]

Application Batch Size [#]

per-packet costsyncrhonization cost

(b) Per-packet cost for different application batch sizes.

Figure 3: Synchronization (measured as per-packet)costs in the merge model for different batch sizes.

the time budget for processing a single minimum-sized packetat 10Gb/s is roughly 67 nanoseconds).

To shed light on the impact of this synchronization coston actual packet processing, we use a VM running netmap’spkt-gen to send 64B packets out as quickly as possible to apkt-gen receiver on dom0 and connected via a VALE switchthat also resides in dom0. Since delivery of notification takestime and the application keeps updating packet buffers, uponeach notification received, the backend usually forwards morepackets than the application batch size, up to 1024 whichis our ring buffer size. We plot the correlation between theapplication and backend-side batch sizes in Figure 3a.

In Figure 3b, the solid line shows the per-packet synchro-nization costs which divide 26.306 microseconds of that syn-chronization cost by the backend-side batch sizes. Dependingon the application batch size, it ranges between 25.68–165.55nanoseconds. These synchronization costs are high, compris-ing 24.45–62.19 % of the entire per-packet processing timeshown as the dashed line in the figure.

2.3 Accurate Resource Allocation

Schedulers in virtualization platforms have interfaces to con-trol how many resources, in terms of percentage of a physicalCPU, each VM is allowed to use. This is a useful tool that letsoperators control per-NF resources (and thus throughput) inorder to meet performance targets. But how accurately doesthis priority percentage apply to a VM that runs an NF? We

-50-40-30-20-10

0 10 20 30 40 50

50:50 70:30 30:70

Erro

r [%

]

VM Priority ( bridge : ipfw )

bridge ipfw

Figure 4: Error when comparing expected through-put from two VMs (as set by Xen’s prioritizationmechanism) versus actual forwarded throughput.

expect that a VM that achieves 10 Mpps with 100% of CPUallocated achieves 5 Mpps with 50% of CPU allocated, asNF processing on fast packet I/O frameworks is usually CPUintensive.

To see what happens in reality, we run two VMs andtheir I/O threads on a single CPU core; one runs a bridgeapplication and the other runs the netmap-ipfw applicationwith 50 filtering rules. We set their priorities to 50%/50%,30%/70% and 70%/30 respectively. For each of the VMs werun a sender and receiver pkt-gen process in dom0.

We then measure how much the forwarded traffic deviatesfrom the split we specified and calculate: 𝑒𝑟𝑟𝑜𝑟 = (𝑔𝑜𝑜𝑑𝑝𝑢𝑡−𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡)/𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡×100. The resultsin Figure 4 show rather large errors ranging anywhere from15% to 47%, meaning that the prioritization mechanism isnot able to provide prioritized performance isolation.Indirect Scheduling Problem: The reason for this is Xen-specific: traffic for all VMs goes through the same driverdomain, and so the hypervisor has no way to separately ac-count for I/O performed on behalf of the different VMs [8].While it would be in principle possible to create a separate,dedicated driver domain for each VM, solving the accountingproblem, this would add twice the number of VMs to thesystem, resulting in additional overheads. This indirect sched-uling problem has also been pointed out in [11]. A solution forthis and the other issues presented in this section is providedin the next section.

3 DESIGN AND IMPLEMENTATION

The goal of HyperNF, our high performance NFV platform,is to meet the four requirements previously outlined. As astarting point, the analysis in the previous section wouldseem to suggest the use of a merge model, since its flexibleuse of (idle) CPU cycles allows us to meet the high utilizationand adaptability requirements. However, the merge modeldoes not scale to high packet rates because of high contextswitching costs.

Is it then possible to use the merge model but significantlyreduce these overheads? One possibility would be to offloadthe I/O thread operations to the hypervisor, performingthe entire packet forwarding in the context of hypercall.


Before designing a system around this model, we wouldlike to understand how much it would help in reducing thesynchronization overheads in the merge model. To do so,we measure hypercall overhead by implementing a NOPhypercall in Xen. On the same server as before, we executethe hypercall one million times and measure the time ittakes to complete. On average, the hypercall costs about 507nanoseconds, 51 times cheaper than the VM context switchcost of 26 microseconds reported before. This measurementshows that hypercall-based I/O has the potential to eliminatea large portion of the synchronization overheads present inthe merge model.

3.1 HyperNF Architecture

Through the analysis in the previous sections we arrive atthe following three design principles:

(1) There should be no CPU cores dedicated solely tovirtual I/O. This principle suggests the use of the mergemodel, and allows us to meet the high utilization andadaptability requirements.

(2) Virtual I/O operations should be accounted to the VMthey are being carried out for. This also points to amerge model and ensures that the NFV platform canprovide accurate resource allocation.

(3) VM context switches between VMs/packet processingand virtual I/O should be avoided. Without this, it isdifficult to meet the high throughput requirement.

These principles point us to HyperNF’s main architecturaldesign: virtual I/O, and in particular packet forwarding,should be done within the context of a hypercall; we call thishypervisor-based virtual I/O.

A naive approach would be to attempt to embed an entiresoftware switch such as VALE or Open vSwitch [29, 41] inthe hypervisor, but this would not only unduly increase thehypervisor’s Trusted Computing Base (TCB), but would alsoconstitute a significant porting effort, since part of the codewould be OS-dependent.

Instead, HyperNF runs the VALE software switch 3 in aprivileged VM (in our implementation on dom0) but exportsto the hypervisor all structures and state needed to implementthe switch’s packet forwarding code (see Figure 5). All of theseobjects are device-independent structures, and the packetforwarding code is even OS-independent.

In all, the changes to the Xen hypervisor are kept small,and are easy to maintain throughout version updates: mostof the LoC and complex parts of VALE are left in dom0.This means that the porting effort is reasonable, the amountof code is fairly minimal, and the code itself is relativelystraightforward. To be more specific, the netmap code baseconsists of 17,655 LoC not including NIC driver code, versus1,173 LoC for the in-hypervisor packet forwarding code (i.e.,only about 6.6% of the code base). In addition, the HyperNFarchitecture allows us to maintain the switch’s management

3While our implementation mainly focuses on a VALE software switch,HyperNF’s architecture can be applied to other software switches suchas Open vSwitch.

Figure 5: HyperNF architecture. Packet switchingand forwarding are done through a hypercall by ex-porting the back-end software switch’s data pathto the hypervisor and, optionally, the fast path ofNIC drivers. Tx/Rx operations issued from a VMare carried by the hypercall and do not require VMswitches.

interfaces unmodified, so it does not break support fromautomation frameworks.

In sum, HyperNF works as a fastpath to the backendswitch, executing all the jobs typically done by I/O threadsin a hypercall. This has multiple advantages. First, the CPUtime for virtual I/O is taken directly from the time allocatedto the vCPU, i.e., vCPU and virtual I/O are a single resourceentity as far as the hypervisor scheduler is concerned, allow-ing for sharing of cycles between packet and I/O processing,for better accounting of I/O operations, and for lower over-heads since there is no longer expensive VM scheduling whendoing I/O. Second, since VM and I/O are now merged, thehypervisor’s CPU load balancing mechanism performs welland vCPU and virtual I/O are always migrated together tothe same CPU core. Third, the platform can now guaran-tee that the batch size set by the packet sender is respectedsince packet forwarding is executed within the hypercall, thusoffering not only high but predictable performance.

In the rest of this section we give a more detailed descrip-tion of how HyperNF sets up the switch and its structures,how packet I/O is done, and what its paravirtualized andNIC drivers look like.

3.2 HyperNF Setup

After the driver domain initializes a VALE switch instanceand attaches NICs and virtual ports to it as usual, HyperNFexports internal and shared objects used by the packet for-warding path to the hypervisor. These objects include device-independent structures such as locking status objects andpacket buffers, and a switch structure used mainly to obtaina list of ports.

When a VM is created, HyperNF’s backend driver (de-scribed later) creates a new virtual port through the standard


mechanism provided by VALE. After this, HyperNF mapspacket buffers, ring indexes and locking status structures ofthe virtual port into the hypervisor, essentially exportingVALE’s forwarding plane. For the actual memory mapping,the driver domain provides the page frame numbers of theobjects to the hypervisor. At this point, Xen allocates a newvirtual memory region and inserts these page table numbersin the entries of the newly allocated virtual memory region.After this operation, the virtual port and switch-related ob-jects can be accessed from the hypervisor over the mappedvirtual memory. Next, HyperNF maps the objects related toa VM’s virtual port(s) into the VM’s address space by usingXen’s grant tables. Once this is all done, a VM can access thenetmap rings and buffers into this contiguous shared memoryarea. Since the layout of these data structures follows thestandard netmap API, VMs can access packet buffers andring objects by offsets.

Additionally, the driver domain has to map switch struc-tures including hash tables and associated port informationto the hypervisor’s address space so that the hypervisor hasaccess to them when making forwarding decisions.

3.3 Packet I/O

HyperNF implements a new hyperio hypercall so that VMscan access the packet switching logic exported to the hyper-visor. The hypercall interface allows for specifying either aTX or an RX operation.

For transmitting packets, a VM puts payloads on packetbuffers and updates the variables of ring indexes. To startpacket switching, the VM invokes the hypercall specifyinga TX operation. Through the hypercall, the execution con-text is immediately switched into the hypervisor and reachesthe entry point of the newly implemented hypercall. Thehypercall executes the same logic as VALE implemented inthe driver domain. Since all required objects are mappedin the hypervisor, the VALE logic in the hypervisor canmaintain consistency. After the execution of packet switch-ing, which includes destination lookup and data movement(memory copy), the hypercall returns, and the VM resumesits execution.

For receiving packets, a VM calls the hypercall this timewith the RX parameter. In the hypercall, HyperNF updatesthe indexes of the RX rings’ according to the packets thathave arrived. After the index update, HyperNF returns tothe VM’s execution context. The applications running in theVM can then consume the available packets based on theupdated ring indexes.

This Tx/Rx behavior closely mimics that of netmap’sstandard execution model, with the slight difference thatwe use a hypercall instead of a syscall and that I/O is notexecuted in the kernel but in the hypervisor.

3.4 Split Driver

Following Xen’s split driver model, HyperNF implementsfrontend and backend drivers as separate Linux kernel mod-ules. The backend driver running in the driver domain (dom0

in our case) creates the switch’s virtual ports and exposesthose ports’ (netmap) buffers. The frontend driver then mapsthose buffers into the VM’s kernel space. Further, the fron-tend driver provides applications with an interface consistingof poll( ) and ioctl( ) syscalls so that they can invokethe hyperio hypercall, and allows VMs to run unmodifiednetmap applications.

3.5 Physical NIC Driver

Just like VALE, HyperNF allows NICs to be directly con-nected to the switch. In particular, HyperNF supports twomodes of operation for NICs:

∙ Hypervisor driver: We port only the Tx/Rx func-tions of the Intel Linux 40Gb driver (i40e) to thehypervisor (about 1,248 LoC versus 48,780 for thewhole driver). As with the switch, we keep most of thelogic in the driver domain (e.g., the driver can stillbe configured by ethtool) and export any necessarystructures such as NIC registers to the hypervisor. Dri-ver initialization is still done by the driver domain (i.e.,Linux).

∙ Driver domain driver: The driver resides in the dri-ver domain/dom0 with I/O thread(s). The hyperio hy-percall notifies dom0 via an event channel after packetforwarding takes place so that dom0 updates the rele-vant device registers.

The first option has the advantage of eliminating I/Othreads and any synchronization costs deriving from them,but has the downside of increasing the hypervisor’s TCB andneeding a porting effort (even if this is sometimes small) foreach individual driver.

The second option has the advantage that it gives the driverdomain the opportunity to mediate packets destined to theNIC, for example, for packet scheduling [37]. In addition,it does not require changes to the hypervisor. A downsideis inherent overheads due to use of I/O thread. We willpresent evaluation results from these two execution modelsin Section 4.

Finally, in both modes, packet reception from a physicalNIC is always handled by I/O threads in the driver domain.The number of CPU cores required depends on packet de-multiplex logic in addition to link speeds [15].

4 BASELINE EVALUATION

In this section we carry out a thorough evaluation of Hy-perNF’s basic performance. We show high throughput andutilization results when running different individual networkfunctions (NFs), when concurrently running many of themon the same server, and results that show HyperNF’s abilityto provide performance guarantees.

For all tests we use a server with an Intel Xeon E5-2690 v4CPU at 2.6 GHz (14 cores, Turbo Boost and Hyper-Threadingdisabled), 64GB of DDR4 RAM and two Intel XL710 40 GbNICs.

We use Xen version 4.8 and the credit2 scheduler, unlessotherwise stated. We disable the rate limit feature of credit2.


The rate limit specification assures a VM can run for aminimum amount of time, and is designed to avoid frequentVM preemption. However, in our use cases, frequent VMpreemption is essential for fast packet forwarding especiallyin the merge model.

For all VMs including dom0, we use Linux 4.6 in PVHmode since this gives better performance than PV mode [40].

For experiments with NICs we use an additional pair ofservers, one for packet generation and one as a receiver; bothof these have a four-core Intel Xeon E3-1231 v3 CPU at3.40GHz, 16GB RAM and an Intel XL710 40 Gb NIC. Bothof these computers are connected via direct cable to our14-core HyperNF server.

As a baseline, we use an I/O thread-based model whosearchitecture is similar to the network backend presented inthe ClickOS [24] paper. VMs transmit and receive packetsusing the paravirtual frontend driver adopting the netmapAPI, and packets are forwarded by the VALE switch in thedriver domain. The same application binaries are run forboth the baseline virtual I/O implementation and HyperNF.

We use the VALE switch as a learning bridge which isthe default implementation in VALE [15]. We set the bridgebatch size of VALE to 100 packets so that the performancescales with multiple senders [15]. Each virtual port has 1024slots for TX and RX rings respectively. In NF tests, each VMhas two virtual ports, and NF applications forward packetsfrom one to another. The VM’s two virtual ports are attachedto two different VALE switches so that packets are forced togo through the NF VM (rather than forwarded locally by aVALE switch).

For all cases, we assign sufficiently higher priority to thedriver domain than VMs using Xen’s CPU resource prioriti-zation mechanism so that the driver domain can be scheduledenough for handling packets. In addition, for all experimentswe use netmap’s pkt-gen application to generate and receivepackets.

Unless otherwise stated, in this section we use a singleCPU core for a VM. We always pin a VM’s vCPU to it, aswell as the VMs’ I/O threads (except for the mix model).The split model requires, by definition, two CPU cores, whichwould make it unfair when comparing to other models and toHyperNF. To work around this, for the split model we stillassign it two CPU cores (one for the VM/vCPU, one for theI/O threads running in dom0) but cap each of the cores to50% 4. For the driver domain driver benchmarks, we applythe same configuration as split. One CPU core is dedicatedfor driver I/O and the other is dedicated to the vCPU. Forthese cases, we use the credit scheduler in order to apply thecap command. 5

4.1 Single NF

For the first experiment, we measure packet I/O performanceby running a pkt-gen generator in one VM and a pkt-gen

4Another option would have been to downclock the cores to half, butthis is not supported in PVH mode.5The credit2 scheduler in Xen-4.8 does not support cap yet.

receiver on another VM (i.e., a baseline measurement withoutan NF). The VMs are connected through a VALE switchinstance running a learning bridge module.

As shown in Figure 6a, HyperNF achieves the highestthroughput for all packet sizes (as much as 60% higher thansplit for 64B packets), with the split and merge models reach-ing roughly similar packet rates. Next, we measure through-put when using an external host and a NIC by running apkt-gen generator in a VM on the HyperNF server and apkt-gen receiver on one of our 4-core servers. We run sepa-rate tests for the NIC driver running in the driver domain(DDD) and when running in the hypervisor (HVD) (Fig-ure 6e). As shown, the split model beats merge because ofmerge’s synchronization issues we reported on earlier. Hy-perNF outperforms both, achieving close to 15Mp/s for 64Bpackets on a single CPU core.

Although the graphs also show that HVD achieves higherthroughput than DDD, the main reason is that the senderVM is allocated only 50 % of one CPU core, limiting trafficgeneration, whereas HVD allocates 100% of the CPU tothe sender VM. We ran the case that allocates 100% of theCPU to the sender VM in DDD, and confirmed that theperformance drop in comparison to HVD was at most 12.6% (not plotted in the graphs).Bridge: For our first NF we use a netmap-based bridge run-ning in a VM. The VM has two virtual ports each connectedto two different VALE switches. In addition, we run twopkt-gen processes in dom0, one acting as a generator andone as a receiver, and each connected to one of the two VALEswitches (so that the bridging is done by the VM and not“locally” by the dom0 VALE switches).

Figure 6b shows the results. As in the baseline experiments,HyperNF outperforms the split and merge models. This time,split does worse than merge because of a speed mismatchbetween vCPU and the I/O thread [34]: whenever the I/Othread finishes quickly, the driver domain’s vCPU goes intoa sleep state, and waking it up incurs overheads that lowerthroughput. One solution would be to add a busy loop inthe I/O thread, which would certainly increase throughputbut at the cost of reducing efficiency and increasing energyconsumption.

Next, we carry out a similar experiment using two physicalNICs each connected to an external 4-core server, one runninga pkt-gen generator and another one a receiver, with theVM acting as a forwarder as before. Figure 6f shows thatonce again HyperNF comes out on top, achieving the bestperformance when using the hypervisor-based driver (HVD).Firewall: For the second NF we use the netmap-ipfw firewallwith 10 rules, making sure that packets are not filtered so thatall rules have to be traversed before a packet is forwarded 6.As with the bridge NF, we use two separate VALE switchesand a pkt-gen sender/receiver pair.

6We run the test with different number of rules but this does notchange HyperNF’s performance relative to split/merge.


0 2 4 6 8

10 12 14 16

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

SplitMergeHyperNF

(a) No NF - virtual ports.

0

2

4

6

8

10

12

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

(b) Bridge NF - virtual ports.

0

1

2

3

4

5

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

(c) Firewall NF - virtualports.

0

1

2

3

4

5

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

(d) Router NF - virtualports.

0 2 4 6 8

10 12 14 16

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

SplitMergeHyperNF HVD HyperNF DDD

(e) No NF - physical ports.

0

2

4

6

8

10

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

(f) Bridge NF - physical

ports.

0

1

2

3

4

5

64 512 1024 1472Th

roug

hput

[Mpp

s]Packet Size [#]

(g) Firewall NF - physical

ports.

0

1

2

3

4

5

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

(h) Router NF - physical

ports.

Figure 6: Throughput when running different NFs one at a time using the split model, the merge model andHyperNF with virtual (top graphs) and physical ports (bottom graphs).

Both in the virtual port case (Figure 6c) and when usingphysical NICs (Figure 6g) HyperNF outperforms the mergeand split models.IP Router: For the final NF in this section we useFastClick [2], a branch of the Click modular router [20]that adds support for netmap and DPDK, among otherfeatures such as batching. We use FastClick to run astandards-compliant IP router with 10 forwarding entries ina VM.

We use the same setup as with the previous NFs but tuneFastClick’s batch size for each model to obtain the highestthroughput: 64 for HyperNF (the recommended size fromthe FastClick paper [2]), 128 for split and 512 for merge (thelarger size for merge is to amortize VM switch costs). Thepattern is as before: merge outperforms split, and HyperNFhas the highest throughput, up to 4.2Mp/s for 64B packets(Figure 6d); the same holds for the physical port case, where,as before, the hypervisor-based driver beats the set-up wherethe driver runs in the driver domain.

Looking at these results, the reader may be wondering whyin some places HyperNF DDD performs worse than split. Theperformance difference comes from the balance (or unbalance)of workload in I/O threads in the driver domain and NFsrunning on the VMs. In the split case, packet switching,including memory copies, is executed by an I/O thread whoseCPU resource is set to 50%. On the other hand, in the DDDcase, all packet switching (except for updating the physicalNIC registers) is accounted to the VMs’ vCPU time, which

is limited to 50%. As a result, the CPU resources assignedto the kernel thread only update physical NIC registers andso are mostly idle (and so wasted), while the CPU resourcesfor the VM’s vCPU are always busy because they have toexecute both the NF application and virtual I/O. As a result,the split model provides better load balancing by splitting aCPU’s resources 50/50 as opposed to the DDD case whichdoes not.

Finally, note that we could have selected a wider range ofNFs, or more complex ones (e.g., a full-fledged DPI basedon Bro [5]). We opted not to since we do not expect overlydifferent results from them: whether a particular NF is moreCPU or I/O-intensive, HyperNF should outperform modelsthat statically assign resources. Further, HyperNF reducescontext switch overheads which would of course apply to anyNF. Having said all of this, it would be interesting to actuallyquantify these effects, so we leave this as future work.

4.2 NF Consolidation

In the next set of experiments we show what happens whenwe start consolidating multiple NFs onto shared CPU cores.We begin by using different numbers of firewall VMs with 10rules each, up to 36 VMs total. We dedicate 6 CPU cores tothe VMs, and we assign/pin the VMs to the cores in a round-robin fashion. Out of the remaining 8 cores on the server, andin order to ensure high offered throughput, we assign 7 coresto the pkt-gen generator processes (there are as many ofthese as VMs) and the last core to a single pkt-gen receiver


0 5

10 15 20

6 12 18 24 30 36Thro

ughp

ut [M

pps]

Number of VMs [#]

Split Merge Mix HyperNF

Figure 7: Cumulative throughput on the NFV plat-form when multiple NFs share cores. In this experi-ment 6 cores are dedicated to the VMs and assignedin a round-robin fashion. Packet size is 256B.

Application Packet Size [B]

Setup VM1 VM2 VM1 VM2

A bridge fw 50 rules 64 1472B fw 10 rules fw 50 rules 1024 512

Table 1: Configurations for experiments measuringthe accuracy of resource allocation.

(receiving is much cheaper so a single process is enough tomatch all the generators). For the mix model, I/O threadsare scheduled by the Linux scheduler, and for the split case1 out of the 6 cores normally given to the VMs is dedicatedto I/O. Further, we set packet size to 256B.

Figure 7 shows that HyperNF outperforms the split, mergeand mix models, achieving close to 18Mp/s when a single NFis running on each core (6 VMs). The graph further demon-strates that HyperNF scales well when running additionalnumbers of NFs per core, reaching roughly 13Mp/s when 6NFs are assigned to each core (36 VMs total). In all, thisshows that HyperNF makes better use of available resourcesthan split (because there’s no separation between I/O andvCPU cores) and has lower synchronization overheads thanmerge.

4.3 Accurate Resource Allocation

We measure how closely HyperNF can comply with specifiedresource allocations, and compare it to the split and mergemodels. To set the experiment up, we create two VMs, bothsharing the same CPU core without resource capping, andwe set different priority settings for different runs of theexperiments. To add variability in terms of load on the system,we vary packet sizes and the NFs run by the VMs accordingto the configurations listed in Table 1.

Each VM is connected through its own VALE in-stance to a pkt-gen sender and receiver pair runningin dom0, with the sender generating packets as fast aspossible. As the metric we calculate how much the sys-tem’s goodput differs from the priority settings as 𝑒𝑟𝑟𝑜𝑟 =(𝑔𝑜𝑜𝑑𝑝𝑢𝑡− 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡)/𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡× 100,

0 2 4 6 8

10 12 14

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

splitmergeHyperNF

(a) No NF, VM-to-VM.

0

2

4

6

8

10

64 512 1024 1472

Thro

ughp

ut [M

pps]

Packet Size [#]

(b) Forwarding NF.

Figure 9: Comparison between the split and mergemodels versus HyperNF on QEMU/KVM. Fig-ures 9a and 9b are comparable to the Xen-basedFigures 6a and 6b, respectively.

where 𝑒𝑥𝑝𝑒𝑐𝑡𝑒𝑑𝑡ℎ𝑟𝑜𝑢𝑔ℎ𝑝𝑢𝑡 = 𝑏𝑎𝑠𝑒𝑙𝑖𝑛𝑒 × 𝑉𝑀𝑝𝑟𝑖𝑜𝑟𝑖𝑡𝑦/100(the baseline is the performance when a VM has 100% of aCPU core for running the NF).

The results are shown in Figure 8 (top row graphs). Weobserve an error of up to 47% for the merge model and forthe split model errors ranging from 4% to as much as 175%.The reason for the latter’s large deviation is that the baselinethroughput is relatively low because of dom0 going idle andinto a sleep state, as previously mentioned before in the singleNF/firewall experiment. When put under higher load (twoVMs requesting I/O operations) the driver domain nevergoes to sleep and so overall throughput jumps up, causingthe error. At any rate, in all setups HyperNF produces amaximal error of only 2.9%.

For the test with physical ports we use the same set up asabove but move the pkt-gen receivers to an external host (oneof our 4-core servers), and, as usual, run two experiments, onewith the hypervisor driver (HVD) and another one with theNIC driver in the driver domain/dom0 (DDD). The resultsare shown in Figure 8 (bottom row graphs). In all cases, themaximum performance reduction error for DDD is 3.55% andonly 0.37% for HVD.

These results show that HyperNF achieves accurate re-source allocation even if we leave the physical NIC driver inthe driver domain. This is because the most expensive partsof the processing, such as packet destination look-ups andmemory copies from Tx to Rx rings are done in the hypercallcontext, with the driver domain only needing to update theNIC’s register variables when invoking packet transmission.

4.4 Platform Independence

To show that HyperNF’s architecture is not dependent ona particular hypervisor (i.e., Xen), we implemented packettransmission and forwarding in the context of the senderVM in Linux KVM by extending ptnetmap [9, 23], which bydefault supports the split and merge models.

Figure 9a plots throughput between two VMs using thesplit and merge models versus HyperNF, and Figure 9bdoes so when a third VM runs an NF that forwards packets


-50 0

50 100 150 200

S M H

Erro

r [%

] VM1 VM2

(a) A-50:50 (VP)

S M H

(b) A-70:30 (VP)

S M H

(c) A-30:70 (VP)

S M H

(d) B-50:50 (VP)

S M H

(e) B-80:20 (VP)

S M H

(f) B-20:80 (VP)

-5 0 5

10 15

HVD DDD

Erro

r [%

]

(g) A-50:50 (PP)

HVD DDD

(h) A-70:30 (PP)

HVD DDD

(i) A-30:70 (PP)

HVD DDD

(j) B-50:50 (PP)

HVD DDD

(k) B-80:20 (PP)

HVD DDD

(l) B-20:80 (PP)

Figure 8: Accuracy of resource allocation when running on virtual ports (VP) and physical ports (PP) withtwo VMs and different priority ratios. In VP graphs, S, M and H represent split, merge and HyperNF,respectively. For PP, the generator runs on the same host as the VMs and the receiver on an external host.The experiment configurations (A or B) are listed in Table 1.

between the other 2 VMs. As with the Xen implementation,the graphs confirm that HyperNF outperforms the othermodels in all cases. Margins of improvement over the mergemodel are smaller than those in the Xen case (Figure 6a and6b); this is because in QEMU/KVM, a context switch to andfrom an I/O thread is cheaper as it is not mediated by aseparate hypervisor.

These results speak to the generality of HyperNF, sinceit is able to speed up performance on Xen and KVM, twohypervisors with important architectural differences. Thegains are smaller on KVM mostly because the hypervisor hasbetter visibility into the resources used by the VMs; however,we believe the gains to still be significant enough to showthat HyperNF can be applied to a range of hypervisors, notjust Xen.

5 NFV EVALUATION

Having provided a baseline evaluation of HyperNF, the pur-pose of this section is to conduct experiments that wouldmore closely mimic the sort of loads that an NFV platform islikely to encounter when deployed. We begin with an evalua-tion of a number of different NFV scenarios and finish withan evaluation of Service Function Chaining (SFC).

5.1 Dynamic NFV Tests

As stated in the introduction, one of the challenges whenbuilding an NFV platform is not only to be able to providehigh aggregate throughput, but also to be able to adapt tochanging conditions not only in terms of the traffic load butalso with respect to the network functions currently runningon the system.

To test HyperNF’s ability to cope with such changingconditions, we conduct experiments for three sets of scenarios,each with an increasing level of variability:

∙ Dynamic per-VM traffic: Throughout the test, wevary the offered throughput sent to the NFs. The cu-mulative offered throughput for the entire system isconstant, as are the number of VMs which run on thesystem; all VMs run the same NF, a firewall.

∙ Dynamic cumulative traffic: Both the cumulativethroughput and the per-VM throughout is dynamic.The number of VMs is constant and they all run thesame firewall NF.

∙ Dynamic traffic, dynamic NFs: Both the cumu-lative throughput and the per-VM throughout is dy-namic. The number of VMs changes throughout thelifetime of the experiment.

In all tests we run 30 VMs on 6 CPU cores but do notpin them, leaving Xen’s credit2 scheduler to perform allo-cation. To generate traffic, we modify pkt-gen so that it ispossible to script what the traffic rate should be at eachpoint in time with a second-level granularity. This lets usnot only set-up the scenarios described above, but also re-play/reproduce the tests both to perform fair comparisonsagainst the split/merge/mix models and for confidence inter-val purposes. Each VM has its own dom0 pkt-gen generator,and all generators send to a single dom0 pkt-gen receiver(as before, we assign 7 cores to the generators and one to thereceiver). In addition, each VM has two virtual ports, oneconnected to a VALE switch instance to which the generatoris attached, and another port connected to a separate switchinstance (this one common to all VMs) to which the singlereceiver is attached. We set packet size to 256B.

Regarding the different models, for split we dedicate twoof the cores to I/O, leaving 4 for the VMs. In the mergemodel, the VM and related I/O thread (i.e., dom0) shouldrun on the same core. The problem is that since we are notpinning these (pinning them would mean under-utilizationand lower throughput with changing traffic conditions), the


0 2 4 6 8

10 12 14 16

0 10 20 30 40 50 60

Thro

ughp

ut [M

pps]

Time [sec]

TotalSplit

MergeMix

HyperNF

Figure 10: Cumulative forwarded throughput for thedynamic per-VM traffic scenario. Packet size is 256Band the VMs all run a 10-rule firewall. The “Total”curve is the offered throughput.

Xen scheduler will periodically shift them to idle cores, butnot necessarily to the same one. To ensure that both I/O andVM stay merged, we use a simple script that, every 10 ms,ensures that the I/O runs on the same core by looking upwhich core the VM is running on and pinning the I/O threadto that same core. Finally, for the mix model, I/O threadsrun on the same 6 cores as the VMs, but they are not pinnedto a CPU core: all assignments of I/O threads to CPU coresare handled by the Linux scheduler running in dom0.

With all of this in place, we first run the dynamic per-VMtraffic scenario (Figure 10). HyperNF yields the highest cu-mulative throughput (i.e., the aggregate of forwarded packetsfrom all VMs).

We plot the results for the second scenario, dynamic cu-mulative traffic, in Figure 11a. HyperNF not only results inthe highest throughput peaks, but it is also able to quicklyadapt to changing traffic conditions, closely matching, forthe most part, the offered throughput (i.e., the black linelabeled “Total”). One exception is at 𝑡 = 18, where HyperNFis still the highest curve but experiences a noticeable dropwith respect to the offered throughput. To explain this, Fig-ure 11b plots a CDF of the traffic rates of each of the 30 flows(one per VM) present at 𝑡 = 18. The CDF shows a largenumber of middle-sized flows (1.5-2Mp/s). What happensis that while a VM is running on a core, because the flowshave fairly high traffic rates the queues of other VMs waitingto be scheduled fill up quickly and drop packets. When thedistribution is more skewed towards a few large flows anda number of small ones (e.g., at 𝑡 = 24 and 𝑡 = 33), thelarge flows are serviced, and while they are, the queues ofthe VMs that are not running are sufficiently large to notdrop packets.

In the final scenario (dynamic traffic, dynamic NFs) weuse the same setup as in the previous scenario, except thistime we begin with only 15 VMs at 𝑡 = 0 and bring three newVMs up every 5 seconds (in actuality we unpause existingVMs so that they become immediately active), up to a total

0 2 4 6 8

10 12 14 16

0 10 20 30 40 50 60

Thro

ughp

ut [M

pps]

Time [sec]

TotalSplit

MergeMix

HyperNF

(a) Throughput.

0.00.20.40.60.81.0

0 0.5 1 1.5 2 2.5 3

CDF

Output Rate of Generator [Mpps]

At 18 sec At 24 sec At 33 sec

(b) Traffic matrix.

Figure 11: Cumulative forwarded throughput for thedynamic cumulative traffic scenario. Packet size is256B and the VMs all run a 10-rule firewall. The“Total” curve is the offered throughput.

of 30 VMs. Then, starting at 𝑡 = 30, we pause 3 VMs every 5seconds until we reach the original 15 VMs; Figure 12b plotsthe number of NFs over time.

The results in Figure 12a demonstrate HyperNF provideshigher throughput than merge, split and mix, and that itscurve closely followed the offered throughput (except fora peak at roughly 𝑡 = 18 as in the previous experiment).This shows HyperNF’s ability to cope with not only changingtraffic loads but with variability in the number of concurrentlyrunning NFs.

5.2 Service Function Chaining

Service Function Chaining has become an essential part of anNFV platform’s toolkit. In this final experiment, we evaluateHyperNF’s performance when running potentially long chains,measuring how throughput decreases and delay increases asa function of chain length.

Each VM has two virtual ports and runs a 10-rule firewall.We run a single pkt-gen sender in dom0 to send 256B packetsto the first VM in the chain, and a single pkt-gen receiveralso in dom0 to sink packets from the last VM in the chain.In addition, we use 10 CPU cores for the VMs except for thesplit model, which uses 9 cores for the VMs and the leftovercore for I/O. In addition to throughput, we measure latencyusing pkt-gen in ping-pong mode.


0 2 4 6 8

10 12 14 16

0 10 20 30 40 50 60

Thro

ughp

ut [M

pps]

Time [sec]

TotalSplit

MergeMix

HyperNF

(a) Throughput

0 5

10 15 20 25 30

0 10 20 30 40 50 60Num

ber o

f NFs

[#]

Time [sec]

(b) Number of Active NFs

Figure 12: Cumulative forwarded throughput for thedynamic traffic, dynamic NFs scenario. Packet sizeis 256B and the VMs all run a 10-rule firewall. The“Total” curve is the offered throughput.

The results are plotted in Figure 13. For all chain lengths,HyperNF shows the highest throughput and the lowest dropin performance as chains gets longer. In terms of latency,HyperNF yields a maximum of 2 milliseconds for a very longchain (50 NFs); this is low compared to typical end-to-endlatencies in the Internet.

6 RELATED WORK

NFV frameworks: The research community has exploredjoint optimizations of NFV platforms and the individual NFswith the goal of achieving higher performance and density.NetBricks [28] and SoftFlow [19] run middleboxes as functioncalls in their own context; Flurries [44] and OpenNetVM [45]run lightweight containers; E2 [27] and BESS [12] provide richmiddlebox programming and placement interfaces. However,these frameworks require middlebox software to be rewritten,not just to use new packet I/O APIs but to adapt to theirspecific execution and protection models, imposing radicalsoftware architecture redesigns. HyperNF supports a standardNFV architecture and does not impose any modifications toexisting software [7, 13, 26].Virtual networking architectures: Elvis [14] reduces I/Othreads in the KVM backend to reduce context switches be-tween them. vRIO [21] runs such threads in dedicated VMs.ptnetmap [9, 23] and ClickOS [24] keep I/O threads in the

0

1

2

3

4

10 20 30 40 50

Thro

ughp

ut [M

pps]

Number of VMs constructing a Chain [#]

SplitMergeMixHyperNF

(a) Throughput

0 1 2 3 4 5

10 20 30 40 50

Late

ncy

[ms]

Number of VMs constructing a Chain [#]

SplitMergeMixHyperNF

(b) Latency

Figure 13: Service chain throughput and latency per-formance for increasing chain lengths.

backend switch in order to reduce VM scheduling. All of theabove are categorized into either the split or merge models,both of which are, as we have found, inefficient for NFVworkloads. HyperSwitch [30] improves intra-VM communica-tion for bulk TCP traffic (e.g., 64KB “packets”), running anOpen vSwitch datapath within the Xen hypervisor. Its archi-tecture is thus somewhat similar to HyperNF, but HyperNFprovides an order of magnitude higher inter-VM packet rates,addresses problems such as VM switch overheads, and canadapt to changing traffic conditions and to the number ofNFs being run on the system.

[25, 31, 38] are optimizations for the standard networkbackend of Xen. [36] optimizes NIC drivers for QEMU/KVM.ELI [10] mitigates overheads caused by VM exits. ReducingVM exits [1, 22] has been a common practice for perfor-mance improvements, but HyperNF shows that aggressivelyexploiting batching with reducing scheduling overheads dra-matically amortizes the costs of kernel and hypervisor cross-ing [16, 39, 43]. vTurbo [42] reduces the latency of virtual I/Oexecution with fine-grained scheduling; HyperNF achievesthe same goal by executing virtual I/O within a hypercall.

7 DISCUSSION AND CONCLUSION

This paper addressed the problem of building a high perfor-mance and efficient VM-based NFV platform. We demon-strated the effects of CPU assignment to VMs and their I/Othreads under different models and showed that running VMsand their I/O threads on the same CPU core yields the bestresults. However, we showed that even with this best practice,


existing VM networking architectures either under-utilize theavailable resources (especially under changing load condi-tions) or incur excessive synchronization overheads whenswitching between VMs and I/O threads. To address theseissues, we presented HyperNF, an NFV platform that adoptsan effective virtual I/O model. By implementing the datapaths of a virtual switch and of a physical NIC driver in thehypervisor, and by using a hypercall as a control path toaccess these data paths, HyperNF eliminates scheduling andVM switching costs for each I/O operation and mitigates thecost of sharing a core between VM and I/O thread. The endresult is a platform that provides high throughput, utilizationand adaptability to changing loads.

HyperNF could also be applicable to other, non-NFV work-loads that at least partly depend on I/O operations. Suchapplications include in-memory key-value stores, web servers,data analytics frameworks and DNS servers, to name a few.We leave an investigation of this direction of research asfuture work.

8 ACKNOWLEDGMENTS

This paper has received funding from the European Union’sHorizon 2020 research and innovation programme under grantagreement no. 671566 (“Superfluidity”). This paper reflectsonly the author’s views and the European Commission is notresponsible for any use that may be made of the informationit contains.

REFERENCES[1] Ole Agesen, Jim Mattson, Radu Rugina, and Jeffrey Shel-

don. 2012. Software Techniques for Avoiding Hardware Vir-tualization Exits. In Presented as part of the 2012 USENIXAnnual Technical Conference (USENIX ATC 12). USENIX,Boston, MA, 373–385. https://www.usenix.org/conference/atc12/technical-sessions/presentation/agesen

[2] T. Barbette, C. Soldani, and L. Mathy. 2015. Fast userspace packetprocessing. In 2015 ACM/IEEE Symposium on Architecturesfor Networking and Communications Systems (ANCS). 5–16.https://doi.org/10.1109/ANCS.2015.7110116

[3] Paul Barham, Boris Dragovic, Keir Fraser, Steven Hand, TimHarris, Alex Ho, Rolf Neugebauer, Ian Pratt, and Andrew Warfield.2003. Xen and the Art of Virtualization. In Proceedings of theNineteenth ACM Symposium on Operating Systems Principles(SOSP ’03). ACM, New York, NY, USA, 164–177. https://doi.org/10.1145/945445.945462

[4] Fabrice Bellard. 2005. QEMU, a Fast and Portable DynamicTranslator. In Proceedings of the Annual Conference on USENIXAnnual Technical Conference (ATEC ’05). USENIX Association,Berkeley, CA, USA, 41–41. http://dl.acm.org/citation.cfm?id=1247360.1247401

[5] Bro. 2017. The Bro Network Security Monitor. https://www.bro.org/. (2017).

[6] Cisco. 2014. The Zettabyte Era—Trends and Analysis - CISCOWhite Paper. http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/vni-hyperconnectivity-wp.html. (2014).

[7] Fortinet. 2014. Fortigate Next Generation Firewall Vir-tual Appliances. https://www.fortinet.com/products/virtualized-security-products/fortigate-virtual-appliances.html.(2014).

[8] Keir Fraser, Steven Hand, Rolf Neugebauer, Ian Pratt, AndrewWarfield, and Mark Williamson. 2004. Reconstructing i/o. Tech-nical Report. University of Cambridge, Computer Laboratory.

[9] S. Garzarella, G. Lettieri, and L. Rizzo. 2015. Virtual devicepassthrough for high speed VM networking. In Architecturesfor Networking and Communications Systems (ANCS), 2015

ACM/IEEE Symposium on. 99–110. https://doi.org/10.1109/ANCS.2015.7110124

[10] Abel Gordon, Nadav Amit, Nadav Har’El, Muli Ben-Yehuda,Alex Landau, Assaf Schuster, and Dan Tsafrir. 2012. ELI: Bare-metal Performance for I/O Virtualization. In Proceedings of theSeventeenth International Conference on Architectural Supportfor Programming Languages and Operating Systems (ASPLOSXVII). ACM, New York, NY, USA, 411–422. https://doi.org/10.1145/2150976.2151020

[11] Diwaker Gupta, Ludmila Cherkasova, Rob Gardner, and AminVahdat. 2006. Enforcing Performance Isolation Across VirtualMachines in Xen. In Proceedings of the ACM/IFIP/USENIX2006 International Conference on Middleware (Middleware ’06).Springer-Verlag New York, Inc., New York, NY, USA, 342–362.http://dl.acm.org/citation.cfm?id=1515984.1516011

[12] Sangjin Han, Keon Jang, Aurojit Panda, Shoumik Palkar, DongsuHan, and Sylvia Ratnasamy. 2015. SoftNIC: A software NIC toaugment hardware. Dept. EECS, Univ. California, Berkeley,Berkeley, CA, USA, Tech. Rep. UCB/EECS-2015-155 (2015).

[13] Mark Handley, Eddie Kohler, Atanu Ghosh, Orion Hodson, andPavlin Radoslavov. 2005. Designing extensible IP router software.In Proceedings of the 2nd conference on Symposium on Net-worked Systems Design & Implementation-Volume 2. USENIXAssociation, 189–202.

[14] Nadav Har’El, Abel Gordon, Alex Landau, Muli Ben-Yehuda,Avishay Traeger, and Razya Ladelsky. 2013. Efficient andScalable Paravirtual I/O System. In Presented as part of the2013 USENIX Annual Technical Conference (USENIX ATC13). USENIX, San Jose, CA, 231–242. https://www.usenix.org/conference/atc13/technical-sessions/presentation/har

[15] Michio Honda, Felipe Huici, Giuseppe Lettieri, and Luigi Rizzo.2015. mSwitch: A Highly-scalable, Modular Software Switch. InProceedings of the 1st ACM SIGCOMM Symposium on SoftwareDefined Networking Research (SOSR ’15). ACM, New York,NY, USA, Article 1, 13 pages. https://doi.org/10.1145/2774993.2775065

[16] Michio Honda, Felipe Huici, Costin Raiciu, Joao Araujo, andLuigi Rizzo. 2014. Rekindling Network Protocol Innovation withUser-level Stacks. SIGCOMM Comput. Commun. Rev. 44, 2(April 2014), 52–58. https://doi.org/10.1145/2602204.2602212

[17] Jinho Hwang, K. K. Ramakrishnan, and Timothy Wood. 2014.NetVM: High Performance and Flexible Networking Using Virtu-alization on Commodity Platforms. In 11th USENIX Symposiumon Networked Systems Design and Implementation (NSDI 14).USENIX Association, Seattle, WA, 445–458. https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/hwang

[18] Intel. 2017. Intel DPDK: Data Plane Development Kit. http://dpdk.org/. (2017).

[19] Ethan J. Jackson, Melvin Walls, Aurojit Panda, Justin Pettit,Ben Pfaff, Jarno Rajahalme, Teemu Koponen, and Scott Shenker.2016. Softflow: A Middlebox Architecture for Open vSwitch. InProceedings of the 2016 USENIX Conference on Usenix AnnualTechnical Conference (USENIX ATC ’16). USENIX Association,Berkeley, CA, USA, 15–28. http://dl.acm.org/citation.cfm?id=3026959.3026962

[20] Eddie Kohler, Robert Morris, Benjie Chen, John Jannotti, andM. Frans Kaashoek. 2000. The Click Modular Router. ACMTrans. Comput. Syst. 18, 3 (Aug. 2000), 263–297. https://doi.org/10.1145/354871.354874

[21] Yossi Kuperman, Eyal Moscovici, Joel Nider, Razya Ladelsky,Abel Gordon, and Dan Tsafrir. 2016. Paravirtual Remote I/O. InProceedings of the Twenty-First International Conference onArchitectural Support for Programming Languages and Operat-ing Systems (ASPLOS ’16). ACM, New York, NY, USA, 49–65.https://doi.org/10.1145/2872362.2872378

[22] Alex Landau, Muli Ben-Yehuda, and Abel Gordon. 2011. SplitX:Split Guest/Hypervisor Execution on Multi-core. In Proceed-ings of the 3rd Conference on I/O Virtualization (WIOV’11).USENIX Association, Berkeley, CA, USA, 1–1. http://dl.acm.org/citation.cfm?id=2001555.2001556

[23] V. Maffione, L. Rizzo, and G. Lettieri. 2016. Flexible virtualmachine networking using netmap passthrough. In 2016 IEEEInternational Symposium on Local and Metropolitan Area Net-works (LANMAN). 1–6. https://doi.org/10.1109/LANMAN.2016.7548852

[24] Joao Martins, Mohamed Ahmed, Costin Raiciu, Vladimir Olteanu,Michio Honda, Roberto Bifulco, and Felipe Huici. 2014. ClickOSand the Art of Network Function Virtualization. In Proceedings

https://www.usenix.org/conference/atc12/technical-sessions/presentation/agesen

https://www.usenix.org/conference/atc12/technical-sessions/presentation/agesen

https://doi.org/10.1109/ANCS.2015.7110116

https://doi.org/10.1145/945445.945462

https://doi.org/10.1145/945445.945462

http://dl.acm.org/citation.cfm?id=1247360.1247401


https://www.bro.org/

https://www.bro.org/

http://www.cisco.com/c/en/us/solutions/collateral/service-provider/visual-networking-index-vni/vni-hyperconnectivity-wp.html



https://www.fortinet.com/products/virtualized-security-products/fortigate-virtual-appliances.html

https://www.fortinet.com/products/virtualized-security-products/fortigate-virtual-appliances.html



https://doi.org/10.1145/2150976.2151020

https://doi.org/10.1145/2150976.2151020


https://www.usenix.org/conference/atc13/technical-sessions/presentation/har

https://www.usenix.org/conference/atc13/technical-sessions/presentation/har

https://doi.org/10.1145/2774993.2775065

https://doi.org/10.1145/2774993.2775065

https://doi.org/10.1145/2602204.2602212

https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/hwang

https://www.usenix.org/conference/nsdi14/technical-sessions/presentation/hwang

http://dpdk.org/

http://dpdk.org/



https://doi.org/10.1145/354871.354874

https://doi.org/10.1145/354871.354874

https://doi.org/10.1145/2872362.2872378



https://doi.org/10.1109/LANMAN.2016.7548852

https://doi.org/10.1109/LANMAN.2016.7548852


of the 11th USENIX Conference on Networked Systems Designand Implementation (NSDI’14). USENIX Association, Berkeley,CA, USA, 459–473. http://dl.acm.org/citation.cfm?id=2616448.2616491

[25] Aravind Menon, Alan L Cox, and Willy Zwaenepoel. 2006. Opti-mizing network virtualization in Xen. In USENIX Annual Tech-nical Conference.

[26] netgate. 2017. pfSence Virtual Appliances. https://www.netgate.com/appliances/pfsense-virtual-appliances.html. (2017).

[27] Shoumik Palkar, Chang Lan, Sangjin Han, Keon Jang, AurojitPanda, Sylvia Ratnasamy, Luigi Rizzo, and Scott Shenker. 2015.E2: A Framework for NFV Applications. In Proceedings of the25th Symposium on Operating Systems Principles (SOSP ’15).ACM, New York, NY, USA, 121–136. https://doi.org/10.1145/2815400.2815423

[28] Aurojit Panda, Sangjin Han, Keon Jang, Melvin Walls, SylviaRatnasamy, and Scott Shenker. 2016. NetBricks: Taking the Vout of NFV. In Proceedings of the 12th USENIX Conferenceon Operating Systems Design and Implementation (OSDI’16).USENIX Association, Berkeley, CA, USA, 203–216. http://dl.acm.org/citation.cfm?id=3026877.3026894

[29] Ben Pfaff, Justin Pettit, Teemu Koponen, Ethan Jackson,Andy Zhou, Jarno Rajahalme, Jesse Gross, Alex Wang, JoeStringer, Pravin Shelar, Keith Amidon, and Martin Casado.2015. The Design and Implementation of Open vSwitch.In 12th USENIX Symposium on Networked Systems Designand Implementation (NSDI 15). USENIX Association, Oak-land, CA, 117–130. https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/pfaff

[30] Kaushik Kumar Ram, Alan L. Cox, Mehul Chadha, and ScottRixner. 2013. Hyper-Switch: A Scalable Software Virtual Switch-ing Architecture. In Presented as part of the 2013 USENIXAnnual Technical Conference (USENIX ATC 13). USENIX,San Jose, CA, 13–24. https://www.usenix.org/conference/atc13/technical-sessions/presentation/ram

[31] Kaushik Kumar Ram, Jose Renato Santos, Yoshio Turner, Alan L.Cox, and Scott Rixner. 2009. Achieving 10 Gb/s Using Safe andTransparent Network Interface Virtualization. In Proceedings ofthe 2009 ACM SIGPLAN/SIGOPS International Conferenceon Virtual Execution Environments (VEE ’09). ACM, New York,NY, USA, 61–70. https://doi.org/10.1145/1508293.1508303

[32] Luigi Rizzo. 2012. Netmap: A Novel Framework for Fast PacketI/O. In Proceedings of the 2012 USENIX Conference on AnnualTechnical Conference (USENIX ATC’12). USENIX Association,Berkeley, CA, USA, 9–9. http://dl.acm.org/citation.cfm?id=2342821.2342830

[33] Luigi Rizzo. 2017. netmap-ipfw. https://github.com/luigirizzo/netmap-ipfw. (2017).

[34] Luigi Rizzo, Stefano Garzarella, Giuseppe Lettieri, and Vin-cenzo Maffione. 2016. A Study of Speed Mismatches BetweenCommunicating Virtual Machines. In Proceedings of the 2016Symposium on Architectures for Networking and Communica-tions Systems (ANCS ’16). ACM, New York, NY, USA, 61–67.https://doi.org/10.1145/2881025.2881037

[35] Luigi Rizzo and Giuseppe Lettieri. 2012. VALE, a SwitchedEthernet for Virtual Machines. In Proceedings of the 8th Inter-national Conference on Emerging Networking Experiments andTechnologies (CoNEXT ’12). ACM, New York, NY, USA, 61–72.https://doi.org/10.1145/2413176.2413185

[36] Luigi Rizzo, Giuseppe Lettieri, and Vincenzo Maffione. 2013.Speeding Up Packet I/O in Virtual Machines. In Proceedings ofthe Ninth ACM/IEEE Symposium on Architectures for Net-working and Communications Systems (ANCS ’13). IEEE Press,Piscataway, NJ, USA, 47–58. http://dl.acm.org/citation.cfm?id=2537857.2537864

[37] Luigi Rizzo, Paolo Velente, Giuseppe Lettieri, and Vincenzo Maf-fione. 2016. PSPAT: Software Packet Scheduling at HardwareSpeed. Technical Report. Universita di Pisa.

[38] Jose Renato Santos, Yoshio Turner, G. Janakiraman, and IanPratt. 2008. Bridging the Gap Between Software and HardwareTechniques for I/O Virtualization. In USENIX 2008 AnnualTechnical Conference (ATC’08). USENIX Association, Berkeley,CA, USA, 29–42. http://dl.acm.org/citation.cfm?id=1404014.1404017

[39] Livio Soares and Michael Stumm. 2010. FlexSC: Flexible SystemCall Scheduling with Exception-less System Calls. In Proceedingsof the 9th USENIX Conference on Operating Systems Designand Implementation (OSDI’10). USENIX Association, Berkeley,

CA, USA, 33–46. http://dl.acm.org/citation.cfm?id=1924943.1924946

[40] Elena Ufimtseva. 2015. PVH: Faster, improved guest model forXen. http://events.linuxfoundation.org/sites/events/files/slides/PVH Oracle Slides LinuxCon final v2 0.pdf. (2015).

[41] Open vSwitch. 2017. Open vSwitch. http://www.openvswitch.org/. (2017).

[42] Cong Xu, Sahan Gamage, Hui Lu, Ramana Kompella, andDongyan Xu. 2013. vTurbo: Accelerating Virtual Machine I/O Pro-cessing Using Designated Turbo-Sliced Core. In Presented as partof the 2013 USENIX Annual Technical Conference (USENIXATC 13). USENIX, San Jose, CA, 243–254. https://www.usenix.org/conference/atc13/technical-sessions/presentation/xu

[43] Kenichi Yasukata, Michio Honda, Douglas Santry, and Lars Eggert.2016. StackMap: Low-Latency Networking with the OS Stack andDedicated NICs. In 2016 USENIX Annual Technical Confer-ence (USENIX ATC 16). USENIX Association, Denver, CO, 43–56. https://www.usenix.org/conference/atc16/technical-sessions/presentation/yasukata

[44] Wei Zhang, Jinho Hwang, Shriram Rajagopalan, K.K. Ramakrish-nan, and Timothy Wood. 2016. Flurries: Countless Fine-GrainedNFs for Flexible Per-Flow Customization. In Proceedings of the12th International on Conference on Emerging Networking EX-periments and Technologies (CoNEXT ’16). ACM, New York,NY, USA, 3–17. https://doi.org/10.1145/2999572.2999602

[45] Wei Zhang, Guyue Liu, Wenhui Zhang, Neel Shah, Phillip Lo-preiato, Gregoire Todeschi, K.K. Ramakrishnan, and TimothyWood. 2016. OpenNetVM: A Platform for High PerformanceNetwork Service Chains. In Proceedings of the 2016 Workshopon Hot Topics in Middleboxes and Network Function Virtual-ization (HotMIddlebox ’16). ACM, New York, NY, USA, 26–31.https://doi.org/2940147.2940155



https://www.netgate.com/appliances/pfsense-virtual-appliances.html

https://www.netgate.com/appliances/pfsense-virtual-appliances.html

https://doi.org/10.1145/2815400.2815423

https://doi.org/10.1145/2815400.2815423



https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/pfaff

https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/pfaff

https://www.usenix.org/conference/atc13/technical-sessions/presentation/ram

https://www.usenix.org/conference/atc13/technical-sessions/presentation/ram

https://doi.org/10.1145/1508293.1508303



https://github.com/luigirizzo/netmap-ipfw

https://github.com/luigirizzo/netmap-ipfw

https://doi.org/10.1145/2881025.2881037

https://doi.org/10.1145/2413176.2413185







http://events.linuxfoundation.org/sites/events/files/slides/PVH_Oracle_Slides_LinuxCon_final_v2_0.pdf

http://events.linuxfoundation.org/sites/events/files/slides/PVH_Oracle_Slides_LinuxCon_final_v2_0.pdf

http://www.openvswitch.org/

http://www.openvswitch.org/

https://www.usenix.org/conference/atc13/technical-sessions/presentation/xu

https://www.usenix.org/conference/atc13/technical-sessions/presentation/xu

https://www.usenix.org/conference/atc16/technical-sessions/presentation/yasukata

https://www.usenix.org/conference/atc16/technical-sessions/presentation/yasukata

https://doi.org/10.1145/2999572.2999602

https://doi.org/2940147.2940155

HyperNF: Building a High Performance, High Utilization and...

Documents

Transcript of HyperNF: Building a High Performance, High Utilization and...