High Performance in the Cloud with FPGA GroupsHigh Performance in the Cloud with FPGA Groups Anca...

High Performance in the Cloud with FPGA Groups

Anca IordacheIRISA / University of Rennes 1

Guillaume PierreIRISA / University of Rennes 1

Peter SandersMaxeler Technologies

Jose Gabriel de F.Coutinho

Imperial College London

Mark StillwellImperial College London

ABSTRACTField-programmable gate arrays (FPGAs) can offer invalu-able computational performance for many compute-intensivealgorithms. However, to justify their purchase and adminis-tration costs it is necessary to maximize resource utilizationover their expected lifetime. Making FPGAs available in acloud environment would make them attractive to new typesof users and applications and help democratize this increas-ingly popular technology. However, there currently exists nosatisfactory technique for offering FPGAs as cloud resourcesand sharing them between multiple tenants. We proposeFPGA groups, which are seen by their clients as a single vir-tual FPGA, and which aggregate the computational powerof multiple physical FPGAs. FPGA groups are elastic, andthey may be shared among multiple tenants. We present anautoscaling algorithm to maximize FPGA groups’ resourceutilization and reduce user-perceived computation latencies.FPGA groups incur a low overhead in the order of 0.09 msper submitted task. When faced with a challenging work-load, the autoscaling algorithm increases resource utilizationfrom 52% to 61% compared to a static resource allocation,while reducing task execution latencies by 61%.

CCS Concepts•Computer systems organization → Cloud comput-ing; Data flow architectures;

KeywordsCloud computing; FPGA; Virtualization.

1. INTRODUCTIONField-programmable gate arrays (FPGAs) offer invaluable

computational performance for many compute-intensive al-gorithms. They are becoming increasingly popular in thehigh-performance computing community for their excellentcomputation density, performance/price and performance/energy

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full cita-tion on the first page. Copyrights for components of this work owned by others thanACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-publish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

UCC ’16 December 6–9, 2016, Shanghai, Chinac© 2016 ACM. ISBN 978-1-4503-4616-0. . . $15.00

DOI: 10.1145/1235

ratios [26]. FPGAs are 40 times faster than CPUs at pro-cessing some of Microsoft Bing’s algorithms [19], and theyare commonly used in domains a diverse as financial dataprocessing [22], signal processing [32], and DNA sequencealignment [3]. Intel recently purchased Altera (the biggestFPGA manufacturer) for $16.7bn, and it expects that 1/3of cloud service provider nodes will use FPGAs by 2020 [15].

To justify the purchase and administration costs of suchdevices it is necessary to maximize resource utilization overtheir expected lifetime. However, maximizing utilizationmay be difficult for many applications whose computationneeds are well below the capacity of one FPGA, or whoseworkload intensity significantly varies over time. The own-ers of such applications are therefore likely to ignore thebenefits of FPGAs and prefer less efficient but more flexiblesolutions.

We claim that making FPGAs available in a cloud envi-ronment would lower the barrier and make them attractiveto new classes of applications. For example, FPGAs can beprogrammed to execute the AdPredictor click-through rateprediction algorithm [13] orders of magnitude faster thanits counterpart implementations based on CPUs [9]. A per-sonalized advertisement service using this algorithm couldexploit this performance to process incoming new facts inreal-time to continuously adapt its recommendations to anychange in user behavior. However, simply providing entirephysical FPGAs attached to a virtual machine instance (sim-ilar to the GPU-enabled instance types proposed by Ama-zon Web Services) would not be sufficient, as the workloadof a personalized advertisement service may vary consider-ably over time. Maximizing the FPGA utilization thereforerequires one to deliver elastic processing capacities rangingfrom fractions of a single device’s capability to that of mul-tiple devices merged together.

Turning complex FPGA devices into easy-to-use cloud re-sources requires one to address two main issues. First, pro-gramming FPGAs requires skills and expertise. FPGAs areessentially a set of logical gates (AND, OR, NOT) whichcan be dynamically wired programmatically. The best per-formance is obtained when the FPGA design closely matchesthe data-flow logic of the program itself, following a pipelinedarchitecture. Circuit designs are typically compiled fromhigh-level programming languages, but this process requiresspecific skills and experience [4]. While this is not a ma-jor issue for large organizations willing to invest massivelyin FPGA technologies, it may be a significant hurdle forsmaller users.

The second issue is the lack of satisfactory techniques forvirtualizing FPGAs. Current solutions are based either onstatically partitioning the gate array between multiple ap-plications (i.e., sharing in space), or on naive context switch-ing (i.e., sharing in time). As discussed in Section 2, bothtechniques exhibit significant problems: sharing in spaceimplies that each application must use a smaller numberof digital gates, thereby negatively impacting performance.Conversely, naive time sharing incurs prohibitive context-switching costs, as reconfiguring an FPGA from one circuitdesign to another takes in the order of a couple of seconds.

To become fully integrated as regular cloud resources, vir-tual FPGAs should exhibit the following properties:

Management: FPGAs should expose an abstract inter-face that allows them to be managed by the cloud platform,including tasks such as allocation, deallocation, deployment,execution, and resource utilization monitoring.

Programmability: Once FPGAs have been provisionedto cloud tenants, they should be programmable to tailortheir application needs, similar to CPU compute resources.

Sharing: Like other types of cloud resources, FPGAsshould allow the same physical device to be shared by mul-tiple tenants in order to maximize resource utilization.

Accessibility: To facilitate sharing, FPGAs should notonly be made available to virtual machines executing on thesame host. Rather, they should be organized as a pool ofresources remotely accessible from any host.

Performance: To retain the high performance/price ra-tio of physical FPGAs, the performance overhead of FPGAvirtualization should remain minimal.

High utilization: When multiple virtual FPGAs com-pete for a limited set of physical resources, the processingcapacity of physical FPGAs should be dynamically assignedto the virtual FPGAs which need it the most.

Isolation: Sharing resources requires that each resourceinstance is completely isolated from each other, not allow-ing tenants to access each other’s data through the shareddevice.

We propose the notion of FPGA groups. An FPGA groupis composed of one or more physical FPGAs which are con-figured with the exact same circuit design. By load-balancingincoming execution requests between its members, an FPGAgroup may be considered by its clients as a virtual FPGAwith aggregates the computational capacity of multiple phys-ical FPGAs. FPGA groups are elastic, as one can easilyadd or remove physical devices to/from a group. An FPGAgroup may, for example, be created by a cloud tenant whosecomputational needs exceed the capacity of a single FPGA.

FPGA groups may also be shared among multiple tenantswho wish to use the same circuit design. Although this con-dition is very unlikely in the case where tenants compile theirown circuit designs from custom code, we claim it is realisticin the case of circuit designs chosen from a standard library.Such a library would typically contain highly-optimized cir-cuits for common types of functions in domains such as dataanalysis (with functions such as regression, correlation andclustering), multimedia (with functions such as video encod-ing and fast Fourier transform), and machine learning (withfunctions for Bayesian and neural networks).

Finally, multiple FPGA groups may also compete for theuse of a limited set of physical devices. In this case, wepresent an autoscaling algorithm which dynamically assignsFPGAs to FPGA groups. This algorithm maximizes FPGA

utilization (which improves the cloud provider’s revenues),while reducing individual task execution times in most cases.

The key novelty in this work is that virtual FPGAs cantime-share a pool of physical FPGAs, which grows or shrinksdynamically in response to load. Cloud tenants view a vir-tual FPGA as a single device which encompasses all its logicresources and memory, which may be backed by multiplephysical FPGAs at runtime. Cloud tenants simply load theircircuit once to create static or self-managed elastic groups.

Our experiments show that FPGA groups incur a lowoverhead in the order of 0.09 ms per task, while effectivelyaggregating the processing capacity of multiple FPGAs. Whenfaced with a challenging workload, our autoscaling algorithmincreases resource utilization from 52% to 61% compared toa static resource allocation, while reducing the average taskexecution latency by 61%.

This paper is organized as follows. Section 2 introducesthe state of the art. Section 3 presents the design of FPGAgroups, and Section 4 discusses elasticity and autoscaling.Finally, Section 5 evaluates our work and Section 6 con-cludes.

2. STATE OF THE ART

2.1 FPGA primerFPGAs are semiconductor devices organized as two-di-

mensional or three-dimensional arrays of Configurable LogicBlocks (CLBs) [21]. Each CLB is an elementary circuitwhich can be programmatically configured to realize a va-riety of simple logical functions, such as AND/OR/NOTdigital gates, flip-flops and 1-bit full adders. The intercon-nection between CLBs is highly reconfigurable as well, al-lowing one to realize any digital electronic circuit — withinthe size limits of the gate array — by proper reconfigura-tion of the CLBs and their interconnection matrix. FPGAboards are often also equipped with external memory and aPCI Express R© interface for high-performance communica-tion with their host machine.

FPGA circuit designs are usually implemented using hard-ware description languages such as Verilog and VHDL, whichoperate at a very low level of abstraction using logic elementssuch as digital gates, registers and multiplexors. These de-scriptions can also be automatically derived from higher-level programming languages similar to C, Java or OpenCL [4,7]. However, to obtain the best possible performance, FPGAprogrammers must still have a deep understanding of themapping between a high-level program and its low-level trans-lation, and annotate their code with appropriate optimiza-tion guidelines [14]. Efficient programming of FPGAs there-fore requires specific training and experience, which effec-tively creates a barrier to entry for new developers to exploitthe full potential of FPGAs. For this reason, FPGA manu-facturers propose highly-optimized off-the-shelf libraries forspecific computations. Although many of these libraries aredomain-specific, they can also provide generic functions suchas various types of data analysis computations, multimediaprocessing, and machine learning [18, 33].

FPGAs can outperform CPU-based compute resources byat least one order of magnitude while running at consider-ably lower clock frequencies due to the fact that the physi-cal organization of a hardware design can closely match thedataflow logic of the implemented algorithm and be imple-mented in a deep pipelined architecture. This allows multi-

Figure 1: Local and remote virtualization of accelerators;(a) using PCI passthrough; (b) using API remoting locally;and (c) using API remoting remotely.

ple function executions to operate in parallel over differentparts of the gate array, providing high-throughput perfor-mance. A typical usage pattern therefore consists for a clientprogram to issue large batches of task execution requestswhich can then be efficiently pipelined by the FPGA.

2.2 Integrating FPGAs in virtualized environ-ments

Client programs running on traditional machines typicallyuse an FPGA as a co-processor device. Applications accessthe FPGA via local library functions designed to translatefunction calls to batches of asynchronous task execution re-quests issued via the PCIe bus.

Integrating FPGAs in a virtualized environment requiresone to rethink the communication between the client pro-gram within a virtual machine, and the FPGA. Ideally, thiscommunication should exhibit as little overhead as possibleso as not to reduce the interest of using accelerators. Itshould also allow any VM to make use of any FPGA device,regardless of the physical location of the VM and the FPGA.There exist three main techniques for this, which have beenapplied either to support access to FPGAs or GPGPUs (theissues and solutions being very similar for both types of de-vices).

I/O passthrough exposes the accelerator to a virtualmachine running on the same server using the ‘passthrough’feature supported by all modern hypervisors [30]. This tech-nique, illustrated in Figure 1(a), is for example used byAmazon EC2 to create GPU-enabled virtual machine in-stances [2]. I/O passthrough delivers performance levelssimilar to the native capacity of the PCI bus. However,it restricts the usage of an accelerator only to the VMs run-ning on the same host. In addition, it currently does notallow multiple co-located VMs to share the same physicalFPGA.

Paravirtualization allows multiple VMs to share a phys-ical accelerator by providing a device model backed by thehypervisor, the guest operating system and the device driver.Examples of this approach include GPUvm [27] and pvF-PGA [31] supporting GPGPUs and FPGAs, respectively.However, this technique suffers from performance overheadsand remains limited to sharing the accelerator between VMsrunning on the same server machine.

API remoting, shown in Figure 1(b), allows multiple ap-plications to use a common accelerator API such as OpenCLor CUDA. Calls to the API are intercepted in the VM andpassed through to the host OS on which the accelerator isaccessible. Each VM appears to have exclusive access to theaccelerator and software on the host to resolves any con-tention. A number of technologies employ API remotingin a tightly coupled systems including vCUDA [25], gVir-tuS [11] and LoGV [12]. In such a case the issue of sharingbecomes one of scheduling and the applications obtain afraction of the accelerator dependent on the other users andthe scheduling algorithm employed.

API remoting also allows accessing a (possibly shared) ac-celerator remotely, as shown in Figure 1(c). In this case, thechoice of network technology becomes very important as itcan become the dominating performance factor. The ad-vent of high-speed networks such as Infiniband which havebandwidths similar to that of PCI provide a solution by en-abling the use of accelerators on separate nodes to the hostapplication. A number of technologies employ API remot-ing over a network including rCUDA [10], DS-CUDA [20],dOpenCL [17] and VirtualCL [5]. Additional studies withInfiniband networks show that this is a practical solution forrCUDA [24] and DS-CUDA [16].

2.3 Related work on FPGA virtualizationAllowing cloud tenants to reserve fractions of an FPGA’s

processing capacity requires a cloud operator to share FPGAdevices among multiple tenants. Most of the research onaccelerator virtualization is focused on the many-to-one so-lution where several applications share a single accelerator,since the accelerator is the expensive device that needs tobe shared.

Sharing in space consists in running multiple indepen-dent FPGA designs (possibly belonging to different tenants)next to each other in the gate array. This is made possible bya technique called partial reconfiguration where the FPGAarea is divided in multiple regions, allowing each region tobe reconfigured with a particular circuit design [6, 8]. Thisapproach effectively parallelizes the execution of the differ-ent designs, and increases the device’s utilization. However,space sharing reduces the area that is made available to hostan FPGA design, which can have a considerable performanceimpact because it limits the number of functional units thatcan work in parallel. Sharing in space also requires someswitching logic in the FPGA to route incoming requests tothe appropriate design, which can add additional overhead.

Sharing in time consists in executing a single FPGA de-sign at any point of time, but of switching the FPGA usagefrom tenant to tenant over time (similarly to operating sys-tem process-level context switching). This approach is oftenoverlooked in the FPGA community due to the very highreconfiguration costs from one design to another: a naiveimplementation would impose prohibitive context switchingcosts in the order of a couple of seconds.

The shared FPGA groups proposed in this paper belong tothe sharing-in-time category. In order to avoid issuing costlyFPGA reconfigurations each time a new task is submitted,shared FPGA groups retain the same configuration acrosslarge numbers of task submissions, and incur reconfigurationcosts only when an FPGA needs to removed from one groupand added to another. Apart from this infrequent operation,

Table 1: A summary of related work.

ApproachAccel.

type

Sharing

methodAccess

Sharing

type

AWS EC2 [2] GPGPU PCI passth. Host Many-to-oneGPUvm [27] GPGPU Paravirt. Host Many-to-onepvFPGA [31] FPGA Paravirt. Host Many-to-onevCUDA [25] GPGPU API remoting Host Many-to-onegVirtuS [11] GPGPU API remoting Host Many-to-onerCUDA [24] GPGPU API remoting Net. Many-to-oneDS-CUDA [16] GPGPU API remoting Net. Many-to-one[6], [8] FPGA Partial reconf. Host Many-to-oneFPGA groups FPGA Time sharing Net. Many-to-many

tasks submitted by multiple tenants can therefore executewith no reconfiguration delay.

A summary of related work is presented in Table 1. Theseapproaches focus on many-to-one scenarios allowing multipleVMs to share a single FPGA. To our best knowledge, ourapproach is the first method which also considers the one-to-many and the many-to-many situations where an applicationmay use a single (virtual) accelerator which is backed bymultiple physical FPGAs.

3. FPGA GROUPSA client application using FPGAs usually executes on a

host that has an FPGA board connected via a dedicatedbus, such as PCI Express R©. The application must first re-configure the FPGA with a specific design which providesthe appropriate function, and then perform a sequence ofI/O operations on the FPGA, including initializing staticdata on the FPGA, streaming input data to it, and stream-ing results from it. The computation is performed on thedata as it flows through the logic of the FPGA design.

However, in this architecture the FPGAs are accessibleonly to applications running on the host machine, which lim-its the number of applications that can share the FPGAs. Tosupport full many-to-many mapping between FPGAs andthe VMs accessing them, our solution is to co-locate a num-ber of FPGAs in so-called “MPC-X” appliances which canbe communicated with using the API remoting techniqueover an Infiniband network [28]. This solution allows theclient applications to run on regular servers and use FPGAsremotely on the MPC-X appliance.

3.1 The MPC-X ApplianceFigure 2 provides an overview of an MPC-X. In this setup,

tasks and results are sent to the appliance across an In-finiband network. This network technology is particularlysuitable for such usage thanks to its low packet latency,and RDMA support which avoids unnecessary data copy-ing within the MPC-X.

The switch fabric of the MPC-X manages the arrival andexecution of tasks on the appropriate FPGA. Although itwould seem natural to implement it in software, and to ex-ecute it in a standard processor within the MPC-X device,such design is likely to incur significant performance over-heads without RDMA support [28, 23]. Instead, the MPC-X implements the switch fabric directly in hardware. Asit is only concerned with the efficient execution of tasks onFPGAs, it can be designed solely for this purpose so its per-formance is not be impacted by the side effects of runningthe application’s CPU code.

Figure 3: System architecture with three application VMsand two MPC-X’s. Application A uses function f() providedby a group of five FPGAs. Applications B and C share agroup of three FPGAs supplying function g().

RAM48GB

FPGA

RAM48GB

FPGA

RAM48GB

FPGA

...

Infin

iba n

d ne

twor

k

Switchfabric

Figure 2: An MPC-X with eightFPGAs and two Infiniband inter-faces.

We envisage that acloud infrastructuresupporting FPGAs ina multi-tenant envi-ronment will consistof multiple MPC-Xappliances intercon-nected with (someof) the CPU ma-chines using Infini-band networks. EachMPC-X provides anumber of FPGA de-vices accessible toany CPU client ma-chine via the Infini-band network.

Our MPC-X appli-ances include eightFPGAs — each equipped with 48 GB of RAM — and twoInfiniband interfaces to overcome any bandwidth limitationof the Infiniband connection.

3.2 Resource ManagementMPC-X’s and physical FPGAs are passive hardware de-

vices. They have no operating system nor embedded soft-ware that can handle even basic resource management oper-ations such as allocating/deallocating FPGAs, sharing thembetween applications, and establishing a connection with au-thorized client programs.

These operations are handled by a software componentcalled the Orchestrator, as shown in Figure 3. The Orches-trator runs in a regular server machine connected to thesame Infiniband network as the MPC-X’s and the client ma-chines. Similar to cloud resource manager services such asOpenStack Nova, the Orchestrator is in charge of maintain-ing the reservation state of each FPGA, including whetherit is currently reserved and by which tenant. When a clientreserves one or more (physical) FPGAs, the Orchestratorchooses available FPGAs, updates their reservation state,and returns a handle to the client containing the address of

the MPC-X it belongs to and a local identifier. The Orches-trator is also in charge of managing the FPGA groups, aswe discuss next.

3.3 FPGA GroupsA client application which requires the combined process-

ing capacity of multiple FPGAs can obviously reserve thedevices it needs, configure all FPGAs with the same design,and load-balance its task execution requests across them.However, this introduces complexity in the application asit would need to decide how many FPGAs to request andthen load balance the tasks across those FPGAs. Sharingthese FPGAs across multiple client machines is even morechallenging.

We propose a virtualization infrastructure using so-calledFPGA groups. An FPGA group presents itself to an appli-cation as a single virtualized FPGA. However this virtualcomputational resource can be backed by a pool of phys-ical FPGAs that perform the actual computation. All theFPGAs within the group are configured with the same hard-ware design. The client application submits tasks to thevirtual FPGA in exactly the same way it would of a singlephysical FPGA.

As shown in Figure 3, in our current implementation, anFPGA group resides entirely within a single MPC-X. Whenthe Orchestrator receives a request to create a new group, itallocates the required number of FPGAs within one MPC-X.It then sets up a new receive (Rx) buffer to hold the arrivingtasks requests and a number of queues equal to the numberof physical FPGAs in the MPC-X (regardless of the numberof FPGAs assigned to this group). A new set of Rx bufferand queues is created each time a new client applicationwants to use the FPGA group, and deleted after the clientapplication terminates. For example, in Figure 3, the FPGAgroup holding function g() is shared between application Band application C; each application access its own Rx bufferand queues in the FPGA server.

On the application side the group is represented as a fixed-size in-flight queue which holds references to the tasks thatare being processed on the MPC-X. There is also a waitqueue which stores the waiting tasks.

We carefully designed this organization to minimize thetask-submission overhead. Each application can submit itsown tasks from the client-side wait queue to the group-sidequeues with no interference with other clients. When a taskarrives in the MPC-X’s Rx buffer, the MPC-X places thetasks into the task queue with the shortest length. Therequest is then placed into the task queue with the shortestlength.

Creating as many task queues as there are FPGAs in theMPC-X implies that the number of task queues is alwaysgreater or equal to the number of FPGAs in a group. Thisallows (as we discuss in the next section) to minimize thetime taken to dynamically scale the group at runtime. Whena queue has at least one task to execute, it attempts to lockan FPGA from the pool of FPGAs in the group. Whena lock has been obtained, the task is executed on the cor-responding FPGA. After execution, the FPGA is unlockedand the task result is sent back to the application. Thus,each queue has a simple and sequential behavior. Note thattasks in the same task queue do not necessarily execute onthe same FPGA.

An additional benefit of creating a separate Rx buffer perclient application is that it makes it easy to revoke an appli-cation’s access to the FPGA group, for example if the tenantdecides to stop using a shared FPGA group.

3.4 DiscussionMemory isolation. One important issue for virtual FPGAs

is memory isolation between multiple independent tasks run-ning simultaneously in the same FPGA group, but poten-tially belonging to different tenants. FPGAs do not haveMemory Management Units (MMUs) nor kernel/userspaceseparation, which makes it impossible for them to page mem-ory the same way CPU-based servers do. When an FPGAgroup is used by a single user (e.g., to serve tasks issued byan elastic set of VMs), the FPGA group will consider allincoming tasks as if they had been issued by a single VM.In multi-tenant scenarios, our system currently does not ad-dress isolation issues. As a consequence, we restrict thismulti-tenant scenario to the use of stateless FPGA designssuch as signal processing and video encoding.

Inter-task dependencies. We assumed so far that tasksare independent from each other: this allows us to executeincoming tasks in a worker thread-pool pattern. However,one may for example want to stream large amounts of dataand to perform statistical analysis across multiple tasks. Toaddress these issues, a client can send a pseudo-task to theFPGA group which performs a ‘lock’ function. This thenreserves a queue and its associated FPGA exclusively forthis client until the ‘unlock’ function is sent. Obviously, wecan lock only a finite number of concurrent times – i.e., thenumber of FPGAs.

Orchestrator scalability. Although we represented theOrchestrator (and the Autoscaler introduced in the next sec-tion) as a single server in our architecture, they maintainvery simple state which may be stored in a fault-tolerantkey-value store: for each FPGA group they essentially storea copy of the circuit design (which represents about 40 MBper design) and the list of FPGAs which are currently partof the group. This allows for easy replication and/or par-titioning across the data-center. In addition, any tempo-rary unavailability of the Orchestrator would only impairthe ability to start/delete/resize FPGA groups, without af-fecting the availability of the existing FPGA groups, or theirfunctionality.

Scaling an FPGA group beyond a single MPC-X.There is no fundamental reason why FPGA groups cannotspan more than one MPC-X. This is essentially a limita-tion of our current implementation: spreading FPGA groupsacross multiple servers would require an additional load-balancing mechanism at the client side, which implies thatthe client-side library must receive notifications upon everyupdate in the list of MPC-X’s belonging to a group. Thisfeature is currently under development.

4. ELASTICITY AND AUTOSCALINGWhat has been described so far uses a fixed number of

physical FPGAs in each FPGA group. However this struc-ture allows the easy addition and removal of FPGAs from agroup without any disruption to executing tasks or any inter-action with the client application. We discuss FPGA groupelasticity first, then present our autoscaling algorithm.

4.1 Elasticity of virtual FPGAs

The design of FPGA groups makes it easy to add or re-move FPGAs to/from an FPGA group located in the sameMPC-X. All that has to be done is reconfigure the FPGAwith the appropriate design and update the load-balancinginformation in the task queues at the server side. The re-sized group still presents itself to an application as a singlevirtualized FPGA, however this virtual computational re-source has a varying number of physical FPGAs that willperform the actual computation.

This elasticity allows a cloud provider to place multipleFPGA groups on the same MPC-X and dynamically reassignphysical FPGAs to groups according to the demand eachgroup is experiencing, similar to the way hypervisors candynamically reallocate the CPU shares granted to virtualCPUs. This automatic elasticity is managed by a softwarecomponent called the Autoscaler, as shown in Figure 3.

4.2 Autoscaling of virtual FPGAsThe workload incurred by FPGA groups may significantly

vary over time because of workload fluctuations in client ap-plications themselves, as well as the arrival or departure ofclient applications making use of each group. Increasingthe number of applications using the same group raises thecontention of the FPGAs in the group and consequently itaffects the performance of individual applications. The Au-toscaler monitors the resource utilization of each group andperiodically rebalances the number of FPGAs assigned toeach group.

The aim of this algorithm is to maximize the infrastruc-ture utilization while improving the overall application com-pletion time. Therefore, more applications get access tothe FPGAs, which in turn is also beneficial for the cloudprovider.

The Autoscaler periodically computes the average taskruntime for each FPGA group (defined as the sum of thequeuing time and the processing time). It then resizes thegroups to minimize the average runtime across all groupswhich share the same MPC-X.

Note that resizing FPGA groups is an expensive opera-tion, as it requires to reconfigure one or more FPGAs with anew hardware design. To avoid spending more time recon-figuring FPGAs than using them for processing incomingtasks, the periodicity (currently set by the adminstrators)at which group size is updated should be at least an orderof magnitude greater than the reconfiguration time, whichwe have observed to be in the order of a few seconds in ourappliances (see Section 5).

The Autoscaler periodically computes the total queue lengthper client application and per FPGA group. A group withempty task queues is scaled down to the minimum work-ing size of one FPGA, allowing other groups to utilize theidle FPGAs. When not enough FPGAs are available to sat-isfy all the demands, the Autoscaler balances the groups byassigning FPGAs in proportion to their workloads.

Figure 4 shows the metrics of interest in FPGA group scal-ing. Tasks from applications can have a different runtime ti.We define T as the total runtime of all tasks in an applica-tion’s queue Q corresponding to a group G: T = t×size(q).

The total processing time for all tasks submitted to agroup G is the sum of the total processing time of all thequeues in the group: RG =

∑Ti.

The objective of the autoscaler is to balance the overallcompletion times for the n groups sharing FPGAs of the

t1

t2 t3

Client2

Client3

Client1

r 1 r 3r 2

G1 G2FPGA FPGA FPGA FPGA FPGA FPGA

Figure 4: Two FPGA groups receiving tasks and the server-side task queues (client-side queues are omitted as they donot participate in the autoscaling algorithm).

same MPC-X. This is achieved by minimizing the mean ofabsolute values of RGi differences:

minimize

n∑i=1

n∑j=1

| RGi −RGj |

n2

Solving this optimization problem requires knowing the sizeof the queues using a group and the runtime t of every task.The information concerning the number of tasks in the queuecan be retrieved periodically from the Orchestrator. On theother hand, the runtime of a task is unknown in the system.In order to determine it we rely on the periodical measure-ments the Autoscaler is retrieving from the Orchestrator.These measurements express how many tasks of each typehave been executed in the measuring interval. Aggregatingthis information for several measurements helps in building alinear equation system where the unknown variables are theruntimes of tasks corresponding to the queues connected toa FPGA group and the coefficients are the number of tasksexecuted in one measurement interval.

The resulting system for one group is as follows:A11 × t1 + A12 × t2 + · · · + A1n × tn = I

A21 × t1 + A22 × t2 + · · · + A2n × tn = I

· · ·Am1 × t1 + Am2 × t2 + · · · + Amn × tn = I

where:

n = number of applications using the group;m = number of past measurements of task execution;I = measuring interval;ti = task runtime corresponding to queue i;Aij = number of executed tasks of type j in the inter-

val i

Note that the Autoscaler builds a system for each group.In order to solve the system and calculate the runtimes ti weapply a non-negative least-squares (NNLS) algorithm [29]:given an m× n matrix A and an m-vector I, the algorithmcomputes an n-vector T that solves the least squares prob-lem A× T = I , subject to T ≥ 0.

Solving the linear system provides an approximation of thetask runtimes which are then used in calculating the totalprocessing time required of a group. The Autoscaler thenassigns an FPGA share proportional to the demand on eachgroup. The scaling algorithm is presented in Algorithm 1.

Algorithm 1 Autoscaling Algorithm

Input: Groups G = {G1, G2, . . . , Gn}Output: Group sizes Snew

1: I = scaling interval (e.g. 10s)2: S ← {S1, S2, . . . , Sn} where Si is the current size of Gi ∈ G3: initialize Snew = {0, 0, · · · , 0}4: for i = 1 to |G| do5: A← {A1, A2, . . . , An} number of executed tasks for each

queue of group Gi in the last measurement interval I6: T ← {T1, T2, . . . , Tn} total runtime estimates for Ai on

Gi of size Si using NNLS7: t ← {Tj/Si} for j ∈ {1 . . . n} runtime estimate of Tj on

single FPGA8: TW ← tasks waiting to be executed on Gi

9: Ri ←∑

TW × t required processing time for Gi

10: Snew ← {Ri×

∑nj=0 Sj∑n

j=0 Rj} for i ∈ {1 . . . n}

11: return Snew

This algorithm tries to treat all applications as fairly aspossible in order to minimize the overall runtime of the cur-rent workload. An extension to this work might rather prior-itize certain applications that are considered more valuable.

5. EVALUATIONWe now evaluate the performance of FPGA groups con-

sidering two perspectives: (i) the cloud provider’s interestto maximize resource utilization and accommodate as manyclient applications as possible; and (ii) client applicationsaiming to minimize the runtime and cost of their tasks.

Our experiments are based on a single MPC-X equippedwith eight ALTERA STRATIX-V FPGAs and two Infini-band interfaces which load-balance the workload betweenapplications and the appliance. The Orchestrator distributesthe workload by selecting which connection an applicationshould use to access the MPC-X at load time. Latency mea-surements were issued at the client side while utilizationmeasurements were done by querying the MPC-X’s publicAPI.

In our experiments, the average time to configure theFPGA when resizing a group is 3.78 s ± 0.13, but the ac-tual time for moving an FPGA from one group to anotheris slightly larger because of the need to wait until currently-running tasks have finished executing. The Autoscaler isconfigured to re-evaluate the FPGA group sizes every 10 sec-onds. An additional server machine submits tasks to theFPGAs.

5.1 Virtualization overheadWe first evaluate the performance overhead due to the

FPGA group mechanisms. We submit a simple task work-load to a single FPGA so that it does not overload theFPGAs: the client issues one task requiring 0.33 ms everysecond.

Figure 5 compares the task execution latency when theapplication addresses a single non-virtualized FPGA (with-out FPGA groups) and a virtualized FPGA group composedof a single FPGA. We can see that, even in the fastest case,the client-side task submission and network transfers be-tween the client machine and the FPGA add a total exe-cution latency in the order of 0.45 ms. This is the reasonwhy FPGA programmers typically submit task executionrequests in batches so they incur the network overhead onlyonce for the whole batch rather than once per task.

0 100 200 300 400 500 600Time (s)

0.5

1.0

1.5

2.0

2.5

3.0

Lat

ency

(ms)

Physical FPGA

V irtual FPGA

(a) Task execution delays

0.0 0.5 1.0 1.5 2.0 2.5Latency (ms)

0

1

2

3

4

5

6

Pro

babi

lity

dens

ity

0.78

0.87

0.33

Taskruntime

Physical FPGA

V irtual FPGA

(b) Statistical distribution of task executiondelays

Figure 5: FPGA virtualization overhead.

0 1 2 3 4 5 6 7 8FPGA Group Size

10−1

100

101

102

103

Ave

rage

Lat

ency

(ms)

1500 tasks/s

3000 tasks/s

6000 tasks/s

Figure 6: Effects of FPGA group elasticity.

We can also see the performance overhead due to vir-tualization: the latency of tasks submitted to the virtualFPGA is on average 0.09 ms greater than when submittedto the physical FPGA. This difference is due to the addi-tional queues introduced by FPGA groups. We can howevernote that virtualization does not increase the variability ofexecution latencies compared to a non-virtualized scenario.

5.2 FPGA group elasticityWe now evaluate the effectiveness of FGPA group elas-

ticity to handle large task execution workloads. We issueda constant load of 1500 tasks/second, 3000 tasks/second or6000 tasks/second to FPGA groups of various sizes. Eachtask has an execution time on the FPGA of 1 ms.

Figure 6 shows the effect of varying the FPGA groupsizes when handling these workloads. The smallest work-load (1500 tasks/second) can in principle be handled withtwo FPGAs. When using only one FPGA, task executionrequests pile up in the queues: execution latencies are veryhigh, with very high variability. However, with a group

of size 2, the task execution latency decreases to roughly0.64 ms, with almost no variability. When increasing thegroup size further, the average task execution latency de-creases even further down to 0.23 ms. Similar behavior hap-pens with larger workloads: when the FPGA group is toosmall to handle the incoming stream of task execution re-quests, task execution latencies are very high, with largestandard deviation. When the group becomes sufficientlylarge to handle the tasks, latencies decrease drastically withvery low standard deviation.

This demonstrates that FPGA groups can actually aggre-gate the processing capacity of multiple physical FPGAs.Varying the group size effectively allows one to control thecapacity of the virtual FPGA, while keeping these opera-tions totally invisible for the client application.

5.3 FPGA group autoscalingWe now evaluate the FPGA group autoscaling algorithm

using three synthetic applications that can be executed usingFPGA groups which compete for resources:Application A consists of tasks which require 0.33 ms of

execution time. This mimics the behavior of a Fast FourierTransform operated over a buffer of 262,144 double-precisionvalues. To maximize execution efficiency, tasks are sub-mitted in batches of 300 tasks.

Application B is a scaled-up version of application A wheretasks require 1 ms of execution time. Here as well, tasksare sent by batches of 300 tasks.

Application C consists of long-running tasks such as videoencoding which require 100 ms of execution time. Tasksare sent by batches of 3 tasks.Figure 7 compares the behavior and performance of static-

size and autoscaled FPGA groups based on two workloadscenarios. For both figures the static-size groups were pro-visioned with two FPGAs for application A, four for appli-cation B, and two for application C. We chose these numberssuch that the group sizes would be proportional to the totalworkload execution time of each application.

Figure 7a shows a scenario with a light workload where theFPGA server has more than enough resources to process allincoming requests, despite workload variations in applica-tions A, B and C (depicted in the top part of the figure). Inthis situation, the static FPGA allocation reaches an aver-age resource utilization of 32%, while keeping low individualbatch execution latencies (on average 0.43 s per batch).

In this scenario where no group resizing is necessary, theautoscaling system incurs the overhead of FPGA group re-configuration while having nothing to gain from this opera-tion. It would be very easy to adjust Algorithm 1 so that itrefrains from any resizing operation when all current queuelengths are small enough. We however decided to let theAutoscaler overreact to workload variations to highlight theoverhead of FPGA group reconfiguration.

The bottom half of Figure 7a shows the resizing decisionstaken by the Autoscaler, and their impact on resource uti-lization and batch execution latency. We see a spike in ex-ecution latency each time an significant resizing operationtakes place. The reason for this is threefold: first, upon anyreconfiguration there is a period of several seconds duringwhich fewer FPGAs are available to process batch execu-tion requests; second, the FPGA group which gets shrunkmay not have sufficient resources to process its workload, sorequests queue up until the next resizing operation where

the group size may be increased again; and, finally, in thisparticular experiment these unavoidable costs are not com-pensated by any benefits from the useless rescaling opera-tion.

Figure 7b shows a scenario with twice as much workload aspreviously. Here, the static FPGA groups cannot efficientlyprocess all incoming requests. In particular, application Adoes not have enough resources to handle its peak workload,so batch execution latencies for application A grow substan-tially, up to 22 s per batch. On the other hand, during peakactivity of application A, application B underutilizes its ownresources so the obvious solution is to temporarily reassignFPGAs from application B to application A. This is exactlywhat the Autoscaler does in this case. As a result, appli-cation A now shows much better performance, at the ex-pense of slightly slowing down application B. Interestingly,between times 150 s and 200 s, both applications A and Bexperience high workloads, which together exceed the to-tal processing capacity of the system. In this situation, theAutoscaler maintains a reasonable fairness between all ap-plications.

These experiments highlight the interest of dynamicallyresizing FPGA groups in situations where multiple appli-cations compete for limited resources. On the one hand,autoscaling increases the average resource utilization from52% to 61%, which allows the cloud provider to process moreworkload with the same resources. Using static groups only,the administrator of application A would probably rather de-cide to reserve additional FPGAs to handle high-workloadscenarios. On the other hand, the Autoscaler benefits thetenants by reducing the average batch execution latency by61%, from 6.49 s to 2.55 s, thanks to a better usage of theseresources.

6. CONCLUSIONFPGAs have the potential of complementing the current

landscape of accelerator devices for high-performance com-putations. When applied to suitable problems, they deliverexcellent performance, computational density and energyconsumption. However, currently FPGAs remain limited toscenarios where a single device is attached to one client ap-plication. Maximizing FPGA utilization in such conditionsis not easy.

We proposed to virtualize FPGAs as a way to increasetheir usage flexibility. With FPGA groups, one can createvirtual FPGAs which aggregate the processing capacity ofone or more physical FPGAs. FPGA groups can be resizedmanually or automatically to maximize device utilizationand further improve user’s experience. FPGA groups canbe used by a single tenant, or shared between multiple ten-ants. However, to isolate tenants from each other, this sec-ond scenario should be limited to stateless FPGA designs,i.e., designs that do not keep state from one job to another.

We envisage that cloud operators may exploit FPGA groupsin two different ways. An Infrastructure-as-a-Service cloudmay augment its compute resource offering by providingFPGA-groups as a service that any tenant may create, resizeand use as they wish. Alternatively, a Platform-as-a-Servicecloud may use shared FPGA groups to offer efficient high-level computation services for standard algorithms such asFast Fourier transform, machine learning and video com-pression, similar to the “Amazon Machine Learning” ser-vice provided by Amazon Web Services [1]. Users of these

246810SubmittedJobs/sec.

AB

C

50100

StaticGroupsUtilisation(%)

32.5

3%±

18.0

2

Tim

e(s

)

246

StaticGroupsJobLatency(s)

Ove

rall

aver

age

:0.4

3s±

0.15

50100

ElasticGroupsUtilisation(%)

38.0

5%±

19.1

2

Tim

e(s

)

246

ElasticGroupsJobLatency(s)

Ove

rall

aver

age

:0.5

9s±

0.64

010

020

030

040

050

060

0T

ime

(s)

02468

ElasticGroupsSize

(a)

Low

work

load

51015

SubmittedJobs/sec.

AB

C

50100

StaticGroupsUtilisation(%)

51.9

4%±

23.2

4

Tim

e(s

)

5101520

StaticGroupsJobLatency(s)

Ove

rall

aver

age

:6.4

9s±

6.68

50100

ElasticGroupsUtilisation(%)

60.9

1%±

28.5

9

Tim

e(s

)

5101520

ElasticGroupsJobLatency(s)

Ove

rall

aver

age

:2.5

5s±

2.76

010

020

030

040

050

060

0T

ime

(s)

02468

ElasticGroupsSize

(b)

Hig

hw

ork

load

Fig

ure

7:

FP

GA

gro

up

auto

scaling

services would not even realize that FPGAs are process-ing their API calls. Enjoying the performance benefits ofFPGA groups using a pay-as-you-go model, and withouthaving to master their complex programming model, wouldarguably help democratize FPGA-based high performancein the cloud.

7. REFERENCES[1] Amazon Web Services. Amazon Machine Learning.

http://aws.amazon.com/aml/.

[2] Amazon Web Services. EC2: Elastic Compute Cloud.http://aws.amazon.com/ec2/.

[3] J. Arram, W. Luk, and P. Jiang. Ramethy:Reconfigurable acceleration of bisulfite sequencealignment. In Proc. ACM/SIGDA Intl. Symposium onField-Programmable Gate Arrays, 2015.

[4] D. F. Bacon, R. Rabbah, and S. Shukla. FPGAprogramming for the masses. ACM Queue, 11(2),2013.

[5] A. Barak and A. Shiloh. The VirtualCL (VCL) clusterplatform. http://www.mosix.org/txt vcl.html.

[6] S. Byma, J. G. Steffan, H. Bannazadeh, A. L. Garcia,and P. Chow. FPGAs in the cloud: Bootingvirtualized hardware accelerators with OpenStack. InProc. FCCM, 2014.

[7] J. M. Cardoso and P. C. Diniz. CompilationTechniques for Reconfigurable Architectures. Springer,2009.

[8] F. Chen, Y. Shan, Y. Zhang, Y. Wang, H. Franke,X. Chang, and K. Wang. Enabling FPGAs in thecloud. In Proc. CF, 2014.

[9] J. G. Coutinho, O. Pell, E. OaAZNeill, P. Sanders,J. McGlone, P. Grigoras, W. Luk, and C. Ragusa.HARNESS project: Managing heterogeneouscomputing resources for a cloud platform. InReconfigurable Computing: Architectures, Tools, andApplications. Springer, 2014.

[10] J. Duato, A. J. Pena, F. Silla, R. Mayo, and E. S.Quintana-Ortı. rCUDA: Reducing the number ofGPU-based accelerators in high performance clusters.In Proc. HPCS, 2010.

[11] G. Giunta, R. Montella, G. Agrillo, and G. Coviello. AGPGPU transparent virtualization component forhigh performance computing clouds. In Proc.Euro-Par, 2010.

[12] M. Gottschlag, M. Hillenbrand, J. Kehne, J. Stoess,and F. Bellosa. LoGV: Low-overhead GPGPUvirtualization. In Proc. FHC, 2013.

[13] T. Graepel, J. Q. Candela, T. Borchert, andR. Herbrich. Web-scale bayesian click-through rateprediction for sponsored search advertising inMicrosoftaAZs Bing search engine. In Proc. ICML,2010.

[14] P. Grigoras, X. Niu, J. G. Coutinho, W. Luk,J. Bower, and O. Pell. Aspect driven compilation fordataflow designs. In Proc. ASAP, 2013.

[15] Intel. Acquisition of altera. Intel Invester ConferenceCall Deck, 2015. http://bit.ly/1Q1VBqK.

[16] A. Kawai, K. Yasuoka, K. Yoshikawa, and T. Narumi.Distributed-shared CUDA: Virtualization oflarge-scale GPU systems for programmability andreliability. Proc. FCTA, 2012.

[17] P. Kegel, M. Steuwer, and S. Gorlatch. dOpenCL:Towards a uniform programming approach fordistributed heterogeneous multi-/many-core systems.In Proc. HCW, 2012.

[18] Maxeler Technologies. Maxeler AppGallery.http://appgallery.maxeler.com/.

[19] R. McMillan. Microsoft supercharges Bing search withprogrammable chips. Wired, 2014.http://www.wired.com/2014/06/microsoft-fpga/.

[20] M. Oikawa, A. Kawai, K. Nomura, K. Yasuoka,K. Yoshikawa, and T. Narumi. DS-CUDA: Amiddleware to use many GPUs in the cloudenvironment. In Proc. SCC, 2012.

[21] D. W. Page. Dynamic data re-programmable PLA.U.S. patent US 4524430 A, 1985.http://www.google.com/patents/US4524430.

[22] S. Parsons, D. E. Taylor, D. V. Schuehler, M. A.Franklin, and R. D. Chamberlain. High speedprocessing of financial information using FPGAdevices. U.S. patent US7921046 B2, 2011.https://www.google.com/patents/US7921046.

[23] S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy,and D. Panda. Efficient inter-node MPIcommunication using GPUDirect RDMA forInfiniBand clusters with NVIDIA GPUs. In Proc.ICPP, 2013.

[24] C. Reano, R. Mayo, E. S. Quintana-Orti, F. Silla,J. Duato, and A. J. Pena. Influence of InfiniBandFDR on the performance of remote GPUvirtualization. In Proc. IEEE Cluster, 2013.

[25] L. Shi, H. Chen, J. Sun, and K. Li. vCUDA:GPU-accelerated high-performance computing invirtual machines. IEEE Transactions on Computers,61(6), 2012.

[26] S. Sirowy and A. Forin. WhereaAZs the beef? WhyFPGAs are so fast. Technical ReportMSR-TR-2008-130, Microsoft Research, 2008.

[27] Y. Suzuki, S. Kato, H. Yamada, and K. Kono.GPUvm: why not virtualizing GPUs at thehypervisor? In Proc. USENIX ATC, 2014.

[28] M. Technologies. New Maxeler MPC-X series:Maximum performance computing for big dataapplications, 2012. http://bit.do/b9ZYX.

[29] Turku PET Centre. libtpcmodel.http://www.turkupetcentre.net/software/libdoc/libtpcmodel/nnls 8c source.html.

[30] J. P. Walters, A. J. Younge, D.-I. Kang, K.-T. Yao,M. Kang, S. P. Crago, and G. C. Fox.GPU-Passthrough performance: A comparison ofKVM, Xen, VMWare ESXi, and LXC for CUDA andOpenCL applications. In Proc. IEEE CLOUD, 2014.

[31] W. Wang, M. Bolic, and J. Parri. pvFPGA: accessingan FPGA-based hardware accelerator in aparavirtualized environment. In Proc. CODES+ISSS,2013.

[32] R. Woods, J. McAllister, Y. Yi, and G. Lightbody.FPGA-based Implementation of Signal ProcessingSystems. Wiley, 2008.

[33] Xilinx Inc. Applications.http://www.xilinx.com/applications.html.

High Performance in the Cloud with FPGA GroupsHigh Performance in the Cloud with FPGA Groups Anca...

Documents

Transcript of High Performance in the Cloud with FPGA GroupsHigh Performance in the Cloud with FPGA Groups Anca...