Kernelet: High-Throughput GPU Kernel Executions with ...The graphics processing unit (or GPU) has...

arX

iv:1

303.

5164

v1 [

cs.D

C]

21

Mar

201

31

Kernelet: High-Throughput GPU KernelExecutions with Dynamic Slicing and Scheduling

Jianlong Zhong, Bingsheng He

Abstract —Graphics processors, or GPUs, have recently been widely used as accelerators in the shared environments such as clustersand clouds. In such shared environments, many kernels are submitted to GPUs from different users, and throughput is an importantmetric for performance and total ownership cost. Despite the recently improved runtime support for concurrent GPU kernel executions,the GPU can be severely underutilized, resulting in suboptimal throughput. In this paper, we propose Kernelet, a runtime system withdynamic slicing and scheduling techniques to improve the throughput of concurrent kernel executions on the GPU. With slicing, Kerneletdivides a GPU kernel into multiple sub-kernels (namely slices). Each slice has tunable occupancy to allow co-scheduling with otherslices and to fully utilize the GPU resources. We develop a novel and effective Markov chain based performance model to guide thescheduling decision. Our experimental results demonstrate up to 31.1% and 23.4% performance improvement on NVIDIA Tesla C2050and GTX680 GPUs, respectively.

Index Terms —Graphics processors, Dynamic scheduling, Concurrent kernel executions, Dynamic Slicing, Performance models

✦

1 INTRODUCTION

The graphics processing unit (or GPU) has become an ef-fective accelerator for a wide range of applications fromcomputation-intensive applications (e.g., [25], [26], [43])to data-intensive applications (e.g, [10], [18]). Comparedwith multicore CPUs, new-generation GPUs can havemuch higher computation power in terms of FLOPSand memory bandwidth. For example, an NVIDIA TeslaC2050 GPU can deliver the peak single precision float-ing point performance of over one Tera FLOPS, andmemory bandwidth of 144 GB/s. Due to their immensecomputation power and memory bandwidth, GPUs havebeen integrated into clusters and cloud computing in-frastructures. In Top500 list of November 2012, two outof the top ten supercomputers are with GPUs integrated.Amazon and Penguin have provided virtual machineswith GPUs. In both cluster and cloud environments,GPUs are often shared by many concurrent GPU pro-grams (or kernels) (most likely submitted by multipleusers). Additionally, to enable sharing GPUs remotely,a number of software frameworks such as rCUDA [7]and V-GPU [45] have been developed. This paper studieswhether and how we can improve the throughput ofconcurrent kernel executions on the GPU.

Throughput is an important optimization metric forefficiency and the total ownership cost of GPUs in suchshared environments. First, many GPGPU applicationssuch as scientific and financial computing tasks are usu-ally throughput oriented [9]. A high throughput leadsto the high performance and productivity for users.Second, compared with CPUs, GPUs are still expensive

• J. Zhong and B. He are with School of Computer Engineering, NanyangTechnological University, Singapore, 639798.E-mail: [email protected], [email protected]

devices. Therefore, a high throughput not only meansa high utilization on GPU resources but also the totalownership cost of running the application on the GPU.That might also be one of the reasons that GPUs areusually deployed and shared to handle many kernelsfrom users.

Recently, we have witnessed the success of GPGPUresearch. However, most studies focus on single-kerneloptimizations (e.g., new data structures [17] and GPUfriendly computing patterns [10], [11]). Despite the fruit-ful research, a single kernel usually severely under-utilizes the GPU. This severe underutilization is mainlydue to the inherent memory and computation behaviorof a single kernel (e.g., irregular memory accesses andexecution pipeline stalls). In our experiments, we havestudied eight common kernels (details are presentedin Section 5). On C2050, their average IPC is 0.52,which is far from the optimal value (1.0). Their memorybandwidth utilization is only ranging from 0.02% to 14%.

Recent GPU architectures like NVIDIA Fermi [27] ar-chitecture supports concurrent kernel executions, whichallows multiple kernels to be executed on the GPUsimultaneously if resources are allowed. In particular,Fermi adopts a form of cooperative kernel scheduling.Other kernels requesting the GPU must wait until thekernel occupying the GPU voluntarily yields control.Here, we use NVIDIA CUDA’s terminology, simply be-cause CUDA is nowadays widely adopted in GPGPUapplications. A kernel consists of multiple executionsof thread blocks with the same program on differentdata, where the execution order of thread blocks is notdefined. On the Fermi GPUs, one kernel can take theentire GPU as long as it has sufficient thread blocksto occupy all the multi-processors (even though it canhave severely low resource utilization). Concurrent exe-cution of two such kernels almost degrades to sequential

http://arxiv.org/abs/1303.5164v1

2

execution on individual kernels. Recent studies [34] onscheduling the concurrent kernels mainly focus on thekernels with low occupancy (i.e., the thread blocks ofa single kernel cannot fully utilize all GPU multipro-cessors). However, the occupancy of kernels (with largeinput data sizes in practice) is usually high after single-kernel optimizations.

Individual kernels as a whole cannot realize the realsharing of the GPU resources. A natural question is:can we slice the kernel into small pieces and thenco-schedule slices from different kernels in order toimprove the GPU resource utilization? The answer isyes. One observation is that GPU kernels (e.g., thosewritten in CUDA or OpenCL) conform to the SPMD(Single Program Multiple Data) execution model. Insuch data-parallel executions, a kernel execution canusually be divided into multiple slices, each consistingof multiple thread blocks. Slices can be viewed as low-occupancy kernels and can be executed simultaneouslywith slices from other kernels. The GPGPU concurrentkernel scheduling problem is thus converted to the slicescheduling problem.

With slicing, we have two more issues to address. Thefirst issue is on the slicing itself: what is the suitableslice size? How to perform the slicing in a transparentmanner? The smallest granularity of a slice is one threadblock, which can lead to significant runtime overhead ofsubmitting too many such small slices onto the GPUs forexecution. To the other extreme, the largest granularityof the slice is the entire kernel, which degrades to thenon-sliced execution. The second issue is how to selectthe slices for co-schedule in order to maximize the GPUutilization.

To address those issues, we develop Kernelet, a runtimesystem with dynamic slicing and scheduling techniquesto improve the GPU utilization. Targeting at the concur-rent kernel executions in the shared GPU environment,Kernelet dynamically performs slicing on the kernels,and the slices are carefully designed with tunable oc-cupancy to allow slices from other kernels to utilize theGPU resources in a complementary way. For example,one slice utilizes the computation units and the otherone on memory bandwidth. We develop a novel and ef-fective Markov chain based performance model to guidekernel slicing and scheduling in order to maximize theGPU resource utilization. Compared with existing GPUperformance models which are limited to a single kernelonly, our model are designed to handle heterogeneousworkloads (i.e., slices from different kernels). We furtherdevelop a greedy co-scheduling algorithm to always co-schedule the slices from the two kernels with the highestperformance gain according to our performance model.

We have evaluated Kernelet on two latest GPU ar-chitectures (Tesla C2050 and GTX680). The GPU kernelsunder study have different memory and computationalcharacteristics. Experimental results show that 1) our an-alytical model can accurately capture the performance ofheterogeneous workloads on the GPU, 2) our scheduling

increase the GPU throughput by up to 31.1% and 23.4%on C2050 and GTX680, respectively.

Organization. The rest of the paper is organized asfollows. We introduce the background and definition ofour problem in Section 2. Section 3 presents the systemoverview, followed by detailed design and implementa-tion in Section 4. The experimental results are presentedin Section 5. We review the related work in Section 6 andconclude this paper in Section 7.

2 BACKGROUND AND PROBLEM DEFINITION

In this section, we briefly introduce the backgroundon GPU architectures, and next present our problemdefinition.

2.1 GPU Architectures

GPUs have rapidly evolved into a powerful accel-erator for many applications, especially after CUDAwas released by NVIDIA [30]. The top tags inhttp://gpgpu.org/ show that a wide range of appli-cations have been accelerated with GPGPU techniques,including vision and graphics, image processing, linearalgebra, molecular dynamics, physics simulation andscientific computing etc. Those applications cover quitea wide range of computation and memory intensiveness.In the shared environment like clusters and cloud, itis very likely that users submit kernels from differentapplications to the same GPU. Thus, it is feasible toschedule kernels with different memory and computa-tion characteristics to better utilize GPU resources.

This paper focuses on the design and implementationwith NVIDIA CUDA. Since OpenCL and CUDA havevery similar designs, our design and implementationcan be extended to OpenCL with little modification.Kernelet takes advantage of concurrent kernel executioncapability of new-generation GPUs like NVIDIA FermiGPUs. With the introduction of CUDA, a GPU can beviewed as a many-core processor with a set of streamingmulti-processors (SM). Each SM has a set of scalar cores,which executes the instructions in the SIMD (SingleInstruction Multiple Data) manner. The SMs are in turnexecuted in the SPMD manner. The program is calledkernel.

In CUDA’s abstraction, GPU threads are organized in ahierarchical configuration: usually 32 threads are firstlygrouped into a warp; warps are further grouped intothread blocks. The CUDA runtime performs mappingand scheduling at the granularity of thread blocks. Eachthread block is mapped and scheduled on an SM, andcannot be split among multiple SMs. Once a thread blockis scheduled, its warps become active on the SM. Warp isthe smallest scheduling unit on the GPU. Each SM usesone or more warp schedulers to issue instructions of theactive warps. Another important resource of the GPUis shared memory, which is a small piece of scratchpadmemory at the scope of a thread block. It is small and

http://gpgpu.org/

3

User N

(VM N)

Nvidia GPUs

GPU Virtualization Layer

User 1

(VM 1)

GPU Server

Kernelet

(a) GPU shared within one ma-chine.

Nvidia GPUs

GPU Server

network

r Remote

Node 1

API Inception API Inception

Remote

Node N

Kernelet

API reception framwork(like rCUDA)

(b) GPU shared by remoteclients.

Fig. 1: Application scenarios of concurrent kernel execu-tions on the GPUs.

has very low latency. Shared memory is visible for allthe threads in the same thread block.

We define the SM occupancy as the ratio of activewarps to the maximum active warps that are allowed torun on the SM. Higher occupancy means higher threadparallelism. The aggregated register and shared memoryusage of all warps should not exceed the total amountof available registers and shared memory on an SM.

Block Scheduling. CUDA runtime system mapsthread blocks to SMs in a round-robin manner. If thenumber of thread blocks in a kernel is less than thenumber of SMs on the GPU, each thread block will beexecuted on a dedicated SM; otherwise, multiple threadblocks will be mapped to the same SM. The number ofthread blocks that can be concurrently executed on the SMdepends on their total resource requirements (in termsof registers and shared memory). If the kernel currentlybeing executed on the GPU cannot fully utilize the GPU,the GPU allows to schedule thread blocks from otherkernels for execution.

GPU Code Compilation. In the process of CUDAcompilation, the compiler first compiles the CUDA Ccode to PTX (Parallel Thread Execution) code. PTX isa virtual machine assembly language and offers a sta-ble programming model for evolving hardware archi-tectures. PTX code is further compiled to native GPUinstructions (SASS). The GPU can only execute the SASScode. CUDA executables and libraries usually provideeither PTX code or SASS code or both. In the sharedenvironments, the source code is usually not availablefor kernel scheduling upon users submit the kernels.Thus, Kernelet should be designed to work on both PTXand SASS code.

2.2 Problem Definition

Application scenario. We consider two typical applica-tion scenarios in the shared environments as shown inFigure 1. One is sharing the GPUs among multiple ten-ants in the virtualized environment (e.g., cloud). As illus-trated in Figure 1a, there is usually a GPU virtualizationlayer integrated with the hypervisor. Figure 1b shows the

other scenario, in which GPU servers offer API receptionsoftwares (like rCUDA [7], [45]) to support local/remoteCUDA kernel launches. In both scenarios, the GPU facesmultiple pending kernel launch requests. Kernelet can beapplied to schedule those pending kernels.

Our study mainly considers the throughput issues ofsharing a single GPU. Kernelet can be extended to multi-ple GPUs with a workload dispatcher to each individualGPU.

We have made the following assumptions on thekernels.

1. We target at the concurrent kernel executions onthe shared GPU. The kernels are usually through-put oriented, with flexibility on the response timefor scheduling. Still, we do not assume the a prioriknowledge of the order of the kernel arrival.

2. Thread blocks in a kernel are independent witheach other. This assumption is mostly true for theGPGPU kernels due to SPMD programming model.Most kernels in NVIDIA SDK and benchmarks likeParboil [13] do not have dependency among threadblocks in the same kernel. This assumption ensuresthat our slicing technique on a given kernel is safe.The data dependency among thread blocks can beidentified with standard static or dynamic programanalysis.

We formally define the terminology in Kernelet.

Kernel. A kernel K consists of k thread blocks withIDs, 0, 1, 2, ..., (k − 1).

Slice. A slice is a subset of the thread blocks of alaunched kernel. Block IDs of a slice is continuous inthe grid index space. The size of a slice s is defined asthe number of thread blocks contained in the slice.

Slicing plan. Given a kernel K, a slicing plan S(K) isa scheme slicing K into a sequence of n slices (s0, s1, s2,..., sn−1). We denote the slicing plan to be K=s0, s1, s2,..., sn−1.

Co-schedule. Co-schedule cs defines concurrent exe-cution of n (n ≥ 1) slices, denoted as s0, ..., sn−1. All then slices are active on the GPU.

Scheduling plan. Given a set of n kernels K0, K1, ...,Kn, a scheduling plan C (cs0, cs1, ..., csn−1) determinesa sequence of co-schedules in their order of execution.csi is launched before csj if i < j. All thread blocks ofthe n kernels occur in one of the co-schedules once andonly once. A scheduling plan embodies a slicing plan foreach kernel.

We define the performance benefit of co-scheduling n

kernels to be the co-scheduling profit (CP ) in Eq. (1).IPCi and cIPCi are IPC (Instruction Per Cycle) forsequential execution and concurrent execution of kernel irespectively. Our definition is similar to those in the pre-vious studies on CPU multi-threaded co-scheduling [20],[39].

CP = 1−1

n∑

i=1

cIPCi

IPCi

(1)

4

Problem definition. Given a set of kernels for execu-tion, the problem is to determine the optimal schedulingplan (and slicing) so that the total execution time forthose kernels is minimized. That corresponds to themaximized throughput. Given a set of n kernels K0,K1, ..., Kn−1, we aim at finding the optimal schedulingplan C for a minimized total execution time of C(S0(K0),S1(K1), ..., Sn−1(Kn−1)), or maximized co-schedulingprofit. Note, in the shared GPU environment, the arrivalof new kernels trigger the recalculation of the optimiza-tion on the kernel residuals and new kernels.

3 SYSTEM OVERVIEW

In this section, we present the rationales on the Kerneletdesign, followed by an overview of the Kernelet runtime.

3.1 Design Rationale

Since kernels are submitted in an ad-hoc manner in ourapplication scenarios, the scheduling decision has to bemade at real time. The optimization process should takethe newly arrived kernels into consideration. Moreover,our runtime system of slicing and scheduling should bedesigned with light overhead. That is, the overhead ofslicing and scheduling should be small compared withtheir performance gain.

Unfortunately, finding the optimal slicing plans andscheduling plan is a challenging task. The solution spacefor such candidate plans is large. For slicing a kernel, wehave the factors including the number of slices as wellas the slice size. For scheduling a set of slices, we cangenerate different scheduling plan with co-schedulingslices from different kernels. All those factors are addedup into a large solution space. Considering newly arrivalkernels makes the huge scheduling space even larger.

Due to the real-time decision making and light-weightrequirements, it is impossible to search the entire spaceto get a globally optimal solution. The classic MonteCarlo simulation methods are not feasible because theyusually exceed our budget on the runtime overhead andviolate real-time decision making requirement. Theremust be a more elegant compromise between the op-timality and runtime efficiency. Our solution to thescheduling problem is detailed in Section 4.2.

Given the complexity of dynamic slicing and schedul-ing in concurrent kernel executions, we make the follow-ing considerations.

First, the scheduling considers two kernels only. Previ-ous studies [20] on the CPU have shown that when thereare more than two co-running jobs, finding the optimalscheduling plan becomes an NP-complete problem evenignoring the job length differences and rescheduling.Following previous studies [20], [34], we make ourscheduling decision co-scheduling two kernels only.

Second, once we choose two kernels to schedule, theirslice sizes keep unchanged until either kernel finishes.The suitable slice size is determined according to ourperformance model 4.4.

Kernelet

Scheduler

Kernel

Slicer

Performance

Model

kernels

GPUslices

coscheduled

slices

Fig. 2: Design overview of Kernelet.

3.2 System Overview

We develop Kernelet as a runtime system to generatethe slicing plan and scheduling plans for the optimizedthroughput.

Figure 2 shows an overview of Kernelet. Kernelsare submitted and are temporarily buffered in a ker-nel queue for further scheduling. Usually, a kernel issubmitted in the form of binary (SASS) or PTX code.Submitted kernels are first preprocessed by kernel slicerto determine the smallest slice size for a given overheadlimit. If the kernel has been submitted before, we simplyuse the smallest slice size in the previous execution. Ker-nelet’s scheduler determines the scheduling plan basedon the results of performance model, which estimatesthe performance of slices from two different kernelsin a probabilistic manner. Once the scheduling plan isobtained, slices are dispatched for execution on the GPU.We will describe the detailed design and implementationof each component in the next section.

4 KERNELET METHODOLOGY

In this section, we first describe our kernel slicing mecha-nism. Next, we present our greedy scheduling algorithm,followed by our pruning techniques on the co-schedulespace and the description of the performance model.

4.1 Kernel Slicing

The purpose of kernel slicing is to divide a kernel intomultiple slices so that the finer granularity of each sliceas a kernel in the scheduling can create more opportu-nities for time sharing. Moreover, we need to determinethe suitable slice size for minimizing the slicing overhead(i.e., between the total execution time of all slices and thekernel execution time) . Particularly, we experimentallydetermine the suitable slice size to be the minimum sliceso that the overhead is not greater than p% of the kernelexecution time. In this study, p% is set to be 2% bydefault. We focus on the implementation of slicing onPTX or SASS code. Note that, warps within the samethread block usually have data dependency with eachother, e.g., with the usage of shared memory. That iswhy we choose thread blocks as the units for slicing,instead of warps.

5

void MatrixAdd(float *MatrixA, float *MatrixB, int width){ int row = blockID_X*blockDim_X+threadID_X; int col = blockID_Y*blockDim_Y+threadID_Y; /* Each thread process one element */ A[row+col*width] += B[row+col*width];}

(a) Each thread access the cor-responding matrix element us-ing block and thread indices.

int

;

/* Grid dimension: 16×16 */dim3 gridConf(16,16);/* Block dimension: 16×16 */dim3 blockConf(16,16);MatrixAdd<<<gridConf, blockConf>>>(...);

(b) Launch the same numberthreads as the number of ma-trix elements.

void MatrixAdd(float *MatrixA, float *MatrixB, int width, dim3 blockOffset, dim3 gridConf){ /* Compute rectified indices*/ int rBlockID_X = blockID_X+blockOffset.x; int rBlockID_Y = blockID_Y+blockOffset.y; /* Process carry bit*/ while (rBlockID_X > gridConf.x){ rBlockID_X -= gridConf.x; rBlockID_Y ++; } /* Replace all subsequent access to blockID_X (blockID_Y) with rBlockID_X (rBlockID_Y) */ ...}

(c) Sliced kernel with rectifiedthread block indices.

/* Sliced grid dimension: 8×1 */dim3 sGridConf(8,1);dim3 blockOffset = (0,0);while(blockOffset.x < gridConf.x && \ blockOffset.y < gridConf.y){ MatrixAdd<<<sGridConf, blockConf>>>(..., blockOffset, gridConf); blockOffset.x += sliceGrid.x; while (blockOffset.x > gridConf.x){ blockOffset.x -= gridConf.x; blockOffset.y ++; }}

int

(d) Launch all slices of a kernelin a loop.

Fig. 3: An example of kernel slicing.

Figure 3 illustrates the sliced execution of MatrixAdd

with pseudo code. In the example, MatrixAdd is a GPUkernel to add two 256 × 256 matrices and each threadadds one element pair from the input matrices. Based onthe matrix size, MatrixAdd is configured to launch with16× 16 thread blocks in total and each block has 16× 16threads. Figures 3a and 3b illustrate the definition andlaunch of the original kernel, respectively. In comparison,the sliced version of the kernel launches a slice with 8thread blocks each time. The built-in thread block indices(denoted as blockID X and blockID Y ) of the slicedkernel are in a smaller index space ({(x, 0)|0 ≤ x < 8})compared with the original index space ({(x, y)|0 ≤ x <

16 and 0 ≤ y < 16}). To make the slices execute asindividual kernels, we apply a procedure called indexrectification. As shown in Figure 3c, we add an offsetvalue to the thread block indices and obtain the rectifiedindex values. The rectified indices are used to replace allsubsequent accesses to the built-in indices. On the CPUside, we launch the slices in a loop and adjust the offsetvalues for each slice launch (Figure 3d).

Kernelet automatically implements slice rectificationwithout any user intervention. With PTX or SASS codeas input, Kernelet does not require source code. Kerneletinterprets and modifies the PTX/SASS code at runtime.The resulting PTX code is compiled to GPU executablesby the GPU driver, and the SASS code is assembled usingthe open source Fermi assembler Asfermi [1]. Kerneletstores the rectified block index in registers, and replacesall references to those built-in variables with the newregisters. Since this process may use more registers.Kernelet tries to minimize the register usage by adoptingthe classic register minimization techniques [5], [33], e.g.,variable liveness analysis. With register optimizations,register usage by slicing keeps unchanged in most ofour test cases in experiments. Note, kernel slicing only

requires a single scan on the input code and the runtimeoverhead is negligible.

4.2 Scheduling

According to our design rationales, our scheduling de-cision is made on the basis of two kernels, to avoidthe complexity of scheduling three or more kernels asa whole; the slice sizes of the two kernels are tunedfor GPU utilization so that their execution times in theconcurrent kernel execution are close. Thus, we developa greedy scheduling algorithm, as shown in Algorithm 1.The scheduling algorithm considers new arrival kernelsin Lines 2–3 in the main algorithm. The main procedurecalls the procedure FindCoSchedule to obtain the optimalco-schedule in Line 5. The co-schedule is representedin four parameters < K1, K2, size1, size2 >, where K1

and K2 denotes the two selected kernels; size1 and size2

represents the slice sizes accordingly. We use the sameco-schedule if the kernels pending for execution do notchange, or both kernels still have thread blocks.

Algorithm 1 Scheduling algorithm of Kernelet

1: Denote R to be the set of kernels pending for execu-tions;

2: if A new kernel K comes then3: Add K into R;4: while R!=null do5: < K1, K2, size1, size2 >=FindCoSchedule(R);6: Denote the co-schedule to be c;7: Execute c on the GPU;8: while R does not change, or K1 and K2 both still

have thread blocks do9: Generate co-schedule according to c and execute

it on the GPU;

Proc. FindCoSchedule(R)Function: generate the optimal co-schedule fromR.

1: Generate the candidate space for co-schedules C;2: Perform pruning on C according to the computation

and memory characteristics of input kernels;3: Apply the performance model (Section 4.4) to com-

pute CP for all the co-schedule in C;4: Obtain the optimal co-schedule with the maximized

CP ;5: Return the result co-schedule;

In Procedure FindCoSchedule, we first consider theentire candidate space consisting of co-schedules onpair-wise kernel combinations. Because the space may

consider of N(N−1)2 co-schedules (N is the number of

kernels for consideration), it is desirable to reduce thesearch space. Therefore, we perform pruning accordingto the computation and memory characteristics of inputkernels. We present the details of pruning mechanismsin Section 4.3. After pruning, we apply the perfor-mance model (Section 4.4) to estimate the CP for all co-

6

schedules, and pick the one with the maximized CP forexecuting on the GPU.

4.3 Co-scheduling Space Pruning

Given a set of co-schedules as input, we aim at devel-oping pruning techniques to remove the co-schedulesthat are not “promising” to deliver performance gain. Sothat the overhead of running the performance model isavoided. The basic idea to identify the key performancefactor of a single kernel that affects the throughput ofconcurrent kernel executions on the GPU.

There are many factors affecting the GPU perfor-mance. According to the CUDA profiler, there are around100 profiler counters and statistics for the performancetuning. For an effective scheduling algorithm, there is noway of considering all these factors. Following a previ-ous work [24], we use the regression model to explorethe correlation between the above mentioned factors andCP . The input values are obtained from single kernelexecutions. Through detailed performance studies, wefind that instruction throughput and memory bandwidthutilization are the most correlated performance factorswith co-scheduling friendliness. We define PUR (PipelineUtilization Ratio) and MUR (Memory-bandwidth Uti-lization Ratio) to characterize user submitted kernels.High PUR means the instruction pipeline is highly uti-lized and there is little room for performance improve-ment. High MUR means a large number of memoryrequests are being processed and memory latency ishigh. PUR and MUR is calculated as follows,

PUR =Instruction Executed

T ime× Frequency × Peak IPC

MUR =Dram Reads+Dram Writes

T ime× Frequency × Peak MPC

Peak IPC and Peak MPC represent peak number ofinstructions and memory requests per cycle respectively.Instruction Executed is the total number of instructionsexecuted. Dram Reads and Dram Writes are the num-bers of read and write requests to DRAM respectively.

We build a set of testing kernels to demonstrate thecorrelation between PUR/MUR and CP . A testing kernelis a mixture of memory and computation instructions.We tune the respective instruction ratios to obtain ker-nels with different memory and computational charac-teristics. The single kernel execution PURs and MURsof the testing kernels are in the range of [0.26, 0.83] and[0.07, 0.84] respectively. Figure 4 shows the strong cor-relation between MUR/PUR and CP . The observationconforms to our expectation. First, if one kernel has highPUR while the other kernel has low PUR, the formerkernel is able to utilize the idle cycles exposed by thelatter kernel when co-scheduled. Second, co-schedulingkernels with complementary memory requirements (onekernel has low MUR and the other kernel has high MUR)

(a) MUR and CP (b) PUR and CP

Fig. 4: Correlation between MUR/PUR and CP

will alleviate memory contention and reduce idle cyclesexposed by long latency memory operations.

In summary, our pruning rule is to remove the co-schedules where the two kernels have close PUR or MURvalues. We set two threshold values αp and αm for PURand MUR, respectively. That means, we prune the co-schedule if the two kernels have PUR difference lowerthan αp, or have MUR difference lower than αm. Note, ifall the co-schedules are pruned, we need to increase αp

or αm. We experimentally evaluate the impact of thosetwo threshold values in Section 5.

4.4 Performance Model

We need a performance model for two purposes: firstly,to select the two kernels for co-schedule; secondly, todetermine the number of thread blocks for each kernelin the co-schedule (i.e., the slice size). Previous perfor-mance models on the GPU [19], [38], [44] assume a singlekernel on the GPU, and are not applicable to concurrentkernel executions. They generally assume that the threadblocks execute the same instruction in a round-robinmanner on an SM. However, this is no longer true onconcurrent kernel executions. The thread blocks from dif-ferent kernels have interleaving executions, which causenon-determinism on the instruction execution flow. It isnot feasible to statically predict the interleaving warp ex-ecutions for multi-kernel executions. To capture the non-determinism, we develop a probabilistic performancemodel to estimate the performance of co-schedule. Ourcost model has very low runtime overhead, because ituses a series of simple parameters as input and leveragesthe Markov chain theory to get the performance ofconcurrent kernel executions.

Table 1 summarizes the notations used for our perfor-mance model.

Since the GPU adopts SPMD model, we use the perfor-mance estimation of one SM to represent the aggregateperformance of all SMs on the GPU. We model the pro-cess of kernel instruction issuing as a stochastic processand devise a set of states for an SM during execution.By modeling the SM state, we first develop our Markovchain based models for single-kernel executions (homo-

7

TABLE 1: Parameters and notations in the performancemodel

Para. DescriptionW Maximum number of active warps

Round A warp scheduling cycle that all ready warpsare served by the warp scheduler

Rm Memory instruction ratioPr→r Probability that a ready warp remains readyPr→i Probability that a ready warp transits to idlePi→r Probability that an idle warp transits to readyPi→i Probability that an idle warp remains in idleNr→r Number of ready warps that remain readyNr→i Number of ready warps that transit to idleNi→r Number of idle warps that transit to readyNi→i Number of idle warps that remain idleL Average memory latency (cycle)B GPU global memory bandwidth (requests/cy-

cle)Si Si corresponds the state where i warps are idle

on the SM (i = 0, 1, . . . ,W )Pij P is the Markov chain transit matrix. Entry

Pij of P represents the probability of transitingfrom state Si to state Sj .

geneous workloads), and then extend it to concurrentkernel executions (heterogeneous workloads).

For presentation clarity, we begin with our descriptionon the model with the following assumptions, and relaxthose assumptions at the end of this section. First, weassume that all the memory requests are coalesced.This is the best case for memory performance. We willrelax this assumption by considering both coalesced anduncoalesced memory accesses. Second, we assume thatthe GPU has a single warp scheduler. We will extend itto the GPU with multiple warp schedulers.

Homogeneous Workloads. We first investigate theperformance of a single kernel executed on the GPU andeach SM accommodates W active warps at most.

A warp can be in two states: idle or ready. An idlewarp is stalled by memory accesses, and a ready warphas one instruction ready for execution. Its transition isillustrated in Figure 5. When a warp is currently in theready state, we have two cases for state transitions bydefinition:

• remaining in the ready state with the probability ofPr→r = 1−Rm.

• transiting to the idle state with the probability ofPr→i = Rm.

When a warp is currently in the idle state, we alsohave two cases for state transitions:

• transiting to the ready state with the probability ofPi→r = 1

LW−I

= W−IL

, where I is the number of idle

warps on the SM.• remaining in the idle state with the probability of

Pi→i = 1− Pi→r .

We specifically define the time step and state tran-sition of the Markov chain model to capture the GPUarchitectural features. GPU adopts a round-robin stylescheduling [19]. In each round, the warp scheduler pollseach warp to issue its ready instructions so all ready

Dependency

Resolved

Issue long latency

memory operation

Issue computation

instruction

ReadyIdle

Dependency

Unresolved

Fig. 5: Warp state transition diagram.

warps can make progress. We model the SM state withthe number of idle warps. We denote Si to be the SMstate where i warps are idle on the SM (i = 0, 1, . . . ,W ).Thus, we consider the state change of the SM in oneround and use round as time step in our Markov chainmodel. In each round, every ready warp has an equalchance to issue instructions. In contrast, models for theCPU assume that the CPU will keep executing onethread until this thread is suspended.

We use IPC to represent the throughput of the SM.Thus, the number of idle warps on the SM is a keyparameter for IPC . Thus, we define the state of SM asthe number of idle warps on the SM (i.e., the state is Si

when the number of idle warps is i). More outstandingmemory requests usually lead to higher latency becauseof memory contention [3]. We adopt a linear memorymodel to account for the memory contention effects. Wecalculate L as L = L0 +

Ba0·Si

+ b0, where a0 and b0 arethe constant parameters in the linear model. We obtainL0 and B according to hardware specifications.

For homogeneous workload, the probabilities of statetransitions are the same for all ready warps in a round.We assume when SM transits from Si to Sj , Ni→r idlewarps transit to the ready state and Nr→i ready warpstransit to idle state. The following conditions hold bydefinition.

0 ≤ Ni→r ≤ Si

0 ≤ Nr→i ≤ W − Si

Nr→i −Ni→r = Sj − Si

(2)

With those constraints, there are multiple possibletransitions to transit from Si to Sj . Since the possibletransitions are mutually exclusive events, the probabilityof each state transition Pij is calculated as the sum of theprobabilities of all possible transitions. With all entriesof the transition matrix P obtained, we can calculatethe steady-state vector of the Markov chain. This isdone by finding the eigenvector π corresponding to theeigenvalue one for matrix P [31].

π = (γ0, γ1, ..., γW ) (3)

In Equation 3, γi is the probabilities that the SM staysin state Si in each round, i.e., the probability there are i

idle warps in one round. The duration of the time step is(W−i) cycles since each of the (W−i) ready warps issueone instruction within the round. In the case i = W , theround duration is one, indicating no warp is ready andthe SM experiences an idle cycle. Hence, the estimatedIPC is the ratio of non-idle cycles given in equation 4,

8

where∑W−1

i=0 γi(W − i) is the total non-idle cycles andγW is the total idle cycles.

IPCK =

∑W−1i=0 γi(W − i)

∑W−1i=0 γi(W − i) + γW

(4)

Heterogeneous Workloads. When there are multiplekernels running concurrently, the model needs to keeptrack of the state of each workload. Although we onlyconsider two concurrent kernels (K1 and K2) in schedul-ing, our model can be used to handle more than twokernels.

Suppose there are two kernels K1 and K2, and K1 hasw1 active warps and K2 has w2 active warps (w1 +w2 =W ). The number of possible states of the SM will be (w1+1)×(w2+1). The state space is represented as a pair (p, q)with 0 ≤ p ≤ w1 and 0 ≤ q ≤ w2, where p and q are thenumbers of idle warps of K1 and K2 respectively. We cancalculate the probability of transiting from state (pi, qi′)to state (pj , qj′) by first considering individual workloadstate transition probability using the single kernel model,and then calculating the SM state transition probability.The state transitions of different kernels are independentwith each other, because the kernels are independent.Then the SM state transition probability is the productof the individual transition probabilities.

With Markov chain approach, we obtain the steadystate vector π = {γ(0,0), γ(0,1), ..., γ(w1,w2)}. Next, we canobtain the IPC of each workload using the same methodas the model in single-kernel executions, except theparameters are defined and calculated in the context oftwo kernels. For example, the round duration is equalto the total number of ready warps of both kernels.Individual IPCs of K1 and K2 is calculated as the ratioof non-idle cycles for each workload, as shown in Eq. (5)and (6), respectively. The concurrent IPC is the sum ofindividual IPCs Eq. (7). Then CP can be obtained usingEq. (1).

IPCK1=

∑w1−1i=0

∑w2

i′=0 γ(i,i′) × (w1 − i)∑w1

i=0

∑w2

j=0 γ(i,j) ×R(i,j)

(5)

IPCK2=

∑w1

i=0

∑w2−1i′=0 γ(i,i′) × (w2 − i′)

∑w1

i=0

∑w2

i′=0 γ(i,i′) ×R(i,i′)(6)

C = IPCK1+ IPCK2

(7)

With the estimated IPC and CP , we now discuss howto estimate the optimal slice size ratio for two kernels.We define the slice ratio which minimizes the executiontime difference of co-scheduled slices as the balancedslice ratio. By minimizing the execution time difference,the kernel-level parallelism is maximized. The executiontime difference is calculated as ∆T in Eq. (8).

∆T = | 1IPCK1

× IK1× PK1

− 1IPCK2

× IK2× PK2

| (8)

IKiand PKi

represent the number of instruction perblock and the slice size of kernel Ki (i = 1, 2) in numberof thread blocks. Since PKi

is less than the maximal

number of active thread blocks, only a limited numberof slice ratios need to be evaluated to get the balancedratio.

Uncoalesced Access. So far, we assume that all mem-ory accesses are coalesced and each memory instruc-tion results in the same number of memory requests.However, due to the different address patterns, memoryinstructions may result in a different amount of memoryrequests. On Fermi GPUs, one memory instruction cangenerate 1 to 32 memory requests. Here we considerthe two most common access patterns: fully coalescedaccess, and fully uncoalesced access. We extend ourmodel to handle both coalesced and uncoalesced ac-cesses by defining three states for a warp: ready, stalledon coalesced access (uncoalesced idle), and stalled onuncoalesced access (coalesced idle). The memory opera-tion latency depends on the memory access type. Sinceuncoalesced access generates more memory traffic, itslatency is higher than coalesced access. We also use thelinear model to estimate the latency. By identifying theratio of coalesced and uncoalesced memory instructions,we can easily extend the two-state model to handlethree states and their state transitions. The Markov chainperformance model can be developed in a similar way.Distinguishing between coalesced and uncoalesced ac-cesses increases the accuracy of our model.

Adaptation to GPUs with multiple warp schedulers.Our model assumes there is only one warp scheduler.New-generation GPUs can support more than one warpschedulers. The latest Kepler GPU features four warpschedulers per SMX (SMX is the Kepler terminologyfor SM) [28]. We extend our model to handle this caseby deriving a single pipeline virtual SM based on theparameters of the SMX. The virtual SM has one warpscheduler, and its parameters such as active threadblocks and memory bandwidth are obtained by dividingthe corresponding parameters of the SMX by the numberof warp schedulers. This virtual SM can still capture thememory and computation features of a kernel runningon the SMX. Experimental results in Section 5 showthat performance modeling on the virtual SM providesa good estimation on the Kepler architectures.

There are two more issues that are worthwhile todiscuss.

The first issue is on the efficiency of executing ourmodel at runtime. We have developed mechanisms tomake our model more efficient without significantlysacrificing the model accuracy. The O(N3) complexityof calculating the steady state in Markov chain makes ithard to meet the online requirement (N is the dimensionof the transition matrix). To reduce the computationalcomplexity, we consider the thread block as a schedul-ing unit, instead of considering individual warps. Inthis way, the computational complexity is significantlyreduced, and time cost of our model is negligible to theGPU kernel execution time.

The second issue is on getting the input for the model.Our current approach is based on hardware profiling of

9

TABLE 2: GPU configurations.

C2050 GTX680Architecture Fermi GF110 Kepler GK104Number of SMs 14 8Number of cores per SM 32 192Core frequency (MHz) 1147 706Global memory size (MB) 3072 2048Global memory bandwidth (GB/s) 144 192

a small number of thread blocks from a single kernel.Thus, the pre-execution is only a very small part ofthe kernel execution. From profiling, we can obtain thenumber of memory instructions issued and the totalnumber of instructions executed, and calculate Rm astheir ratio.

In summary, our probabilistic model has captured theinherent non-determinism in concurrent kernel execu-tions. First, it simply requires only a small set of profilinginputs on the memory and computation characteristicsof individual kernels. Second, with careful probabilisticmodeling, we develop a performance model that issufficiently accurate to guide our scheduling decision.The effectiveness of our model will be evaluated in theexperiments (Section 5).

5 EVALUATION

In this section, we present the experimental results onevaluating Kernelet on latest GPU architectures.

5.1 Experimental Setup

We have conducted experiments on a workstationequipped with one NVIDIA Tesla C2050 GPU, oneNVIDIA GTX680 GPU, two Intel Xeon E5645 CPUs and24GB RAM. Table 2 shows some architectural features ofC2050 and GTX680. We note that C2050 and GTX680 arebased on Fermi and Kepler architectures, respectively.One C2050 SM has two warp schedulers, and each canserve half a warp per cycle (with a theoretical IPC ofone). In contrast, one GTX680 SMX features four warpschedulers and each warp scheduler can serve one warpper cycle (with a theoretical IPC of eight considering itsdual-issue capability). Our implementation is based onGCC 4.6.2 and NVIDIA CUDA toolkit 4.2.

Workloads. We choose eight benchmark applicationswith different memory and computation intensivenesses.Sources of the benchmarks include the CUDA SDK, theParboil Benchmark [13], the CUSP library [4] and ourhome grown applications. Table 3 describes the detailsof each application, including input settings and threadconfigurations of the most time-consuming kernel onC2050.

Table 4 shows the memory and computation char-acteristics of the most time-consuming kernel of eachapplication on both C2050 and GTX680. We observedthat the PUR/MUR values are stable as we vary theinput sizes (as long as the input size is sufficiently largeto keep the GPU occupancy high).

TABLE 3: Specification of benchmark applications andthread configuration (#threads per thread block ×#thread blocks).

Name Description Input settings Thread configu-ration on C2050

Pointer Chas-ing (PC)

Traversing anarray randomly

Index values for 40million accesses

256 × 16384

Sum of Ab-solute Differ-ences (SAD)

An operationused in MPEGencoding

Image with 1920×1072 pixels

32 × 8048

SparseMatrix VectorMultiplica-tion (SPMV)

Multiplying asparse matrixwith a densevector.

A 131072×81200matrix with 16non-zero elementsper row on average

256 × 16384

Stencil (ST) Stenciloperation on aregular 3-D grid

3D grid with134217728 points

128 × 16384

Matrix Mul-tiplication(MM)

Multiplying twodense matrices

One 8192 × 2048matrix, the other2048 × 2048

256 × 16384

MagneticResonanceImaging - Q(MRIQ)

A matrix opera-tion in magneticresonance imag-ing

2097152 elements 256 × 8192

Black Scholes(BS)

Black-ScholesOption Pricing

40 million 128 × 16384

TinyEncryptionAlgorithm(TEA)

A thread blockcipher

20971520 elements 128 × 16384

TABLE 4: Memory and computational characteristics ofbenchmark applications.

BenchmarksC2050 GTX680

PUR MUR Occupancy PUR MUR OccupancyPC 0.0096 0.1404 100% 0.0072 0.1746 100%SAD 0.1498 0.1120 16.7% 0.1062 0.1351 25%SPMV 0.3464 0.003 100% 0.3027 0.0043 100%ST 0.3629 0.1156 66.7% 0.2016 0.1179 100%MM 0.5804 0.0161 67.7% 0.5321 0.0569 100%MRIQ 0.8539 0.0002 83.3% 1.6784 0.0007 100%BS 0.8642 0.0604 67.7% 1.2007 0.1323 100%TEA 0.9978 0.0196 67.7% 1.1417 0.0353 100%

To assess the impact of kernel scheduling under dif-ferent mixes of kernels, we create four groups of kernelsnamely CI, MI, MIX and ALL (as shown in Table 5).CI represents the computation-intensive workloads in-cluding kernels with high PUR, whereas MI representsworkloads with intensive memory accesses. MIX andALL include a mix of CI and MI kernels. ALL hasmore kernels than MIX. In each workload, we assumethe application arrival conforms to Poisson distribution.The parameter λ in the Poisson distribution affects theweight of the application in the workload. For simplicity,we assume that all application has the same λ. Wealso assume λ is sufficiently large so that at least twokernels are pending for execution at any time for a highutilization of the GPU.

Comparisons. To evaluate the effectiveness of kernelscheduling in Kernelet, we have implemented the fol-

TABLE 5: Workload configurations.

Workload ApplicationsCI BS, MM, TEA, MRIQMI PC, SPMV, ST, SADMIX PC, BS, TEA, SADALL PC, SPMV, ST, BS, MM, TE, MRIQ, SAD

10

0

10

20

30

40

50

60

70

14 28 42 56 70 84 98 112

Sliced

Execu

tio

n O

verh

ea

d (

%)

Slice Size

PC

SPMV

ST

BS

MM

TEA

MRIQ

SAD

(a) C2050

-1

0

1

2

3

4

5

6

8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128

Sliced

Execu

tio

n O

verh

ea

d (

%)

Slice Size

PC

SPMV

ST

BS

MM

TEA

MRIQ

SAD

(b) GTX680

Fig. 6: Sliced execution overhead with varying slice size on both C2050 and GTX680.

lowing scheduling techniques:

• Kernel Consolidation (BASE): the kernel consolida-tion approach of concurrent kernel execution [34].

• Oracle (OPT): OPT uses the same scheduling al-gorithm as Kernelet, except that it pre-executes allpossible slice ratios for all combinations to obtainthe CP and then determines the best slice ratio andkernel combination. In another word, OPT is anoffline algorithm and provides the optimal through-put for the greedy scheduling algorithm.

• Monte Carlo-based co-schedule (MC): we developa Monte Carlo approach to generate the distributionof performance of different co-schedule methods inthe solution space. In each Monte Carlo simulation,we randomly pick the kernel pairs and slice ra-tios for co-scheduling. Through many Monte Carlosimulations, we can quantitatively understand howdifferent co-schedules affect the performance. Wedenote the result of MC to be MC(s), where s isthe number of Monte Carlo simulations.

5.2 Results on Kernel Slicing

We first evaluate the overhead of sliced execution, whichis defined as Ts

Tns−1, where Ts and Tns are the sliced and

unsliced execution time, respectively.Figure 6 shows the overhead for executing individual

kernels with varying slice sizes on C2050 and GTX680.Slice sizes are set to multiples of the number of SMson the GPU and ranges from |SM | to the maximumnumber under the occupancy limit. Overall, as the slicesize increases, the slicing overhead decreases. However,we observe quite different performance behaviors onC2050 and GTX680, due to their architectural differences.On C2050, when the size is small, the slicing overhead isvery high (up to 66.7% for SAD). When the slice is largerthan or equal to 42 (three thread blocks per SM), theoverhead is ignorable for most kernels. Sliced executionoverhead on GTX680 is much smaller than on C2050.Almost all slice sizes lead to overhead less than 2% onGTX680. Regardless of the architectural differences, theignorable overhead of kernel slicing allows us to exploitkernel slicing for co-scheduling slices from differentkernels with little additional cost.

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8

Pred

icte

d I

PC

Measured IPC

(a) C2050

0

0.8

1.6

2.4

3.2

4

4.8

0 0.8 1.6 2.4 3.2 4 4.8 5.6 6.4 7.2

Pred

icte

d I

PC

Measured IPC

(b) GTX680

Fig. 7: Comparison between predicted and measuredsingle kernel execution IPCs on two GPUs.

5.3 Results on Model Prediction

We evaluate the accuracy of our performance model indifferent aspects, including the estimation of IPC s forsingle kernels and concurrent kernel executions, and CP

prediction for concurrent kernel executions.

Single Kernel Performance Prediction. Figure 7 com-pares the measured and estimated IPC values for theeight benchmark applications on C2050 and GTX680. Wealso show the two lines ( y = x ± 0.2 for C2050 andy = x±1.6 for GTX680) to highlight the scope where dif-ference between measurement and estimation is within± 20% of the peak IPC. Note, the theoretical IPCs forC2050 and GTX680 are one and eight respectively. Ifthe result falls in this scope, we consider the estimationwell captures the trend of the measurement. We cansee that, most results are within the scope. We furtherdefine the absolute error to be |e− e′|, where e and e′ arethe measured and estimated IPC values, respectively.The average absolute error for the eight benchmarkapplications is 0.08 and 0.21 on C2050 and GTX680,respectively. Our probabilistic model has achieved areasonable accuracy in estimating the performance ofsingle-kernel executions on the GPU.

Concurrent Kernel Performance Prediction. For theeight benchmark applications, we run every possiblecombination of kernel pairs and measure the IPC foreach combination. Figure 8 compares the measured andpredicted IPCs with the suitable slice ratio given by ourmodel. We have also studied other slicing ratios. Figure 9compares the measured and predicted IPCs with fixed

11

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1 1.2

Pred

icte

d I

PC

Measured IPC

(a) C2050

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

Pred

icte

d I

PC

Measured IPC

(b) GTX680

Fig. 8: Comparison between predicted and measuredconcurrent kernel execution IPCs on two GPUs withoptimal slice ratio.

0

0.2

0.4

0.6

0.8

1

1.2

0 0.2 0.4 0.6 0.8 1

Pred

icte

d I

PC

Measured IPC

(a) C2050

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

Pred

icte

d I

PC

Measured IPC

(b) GTX680

Fig. 9: Comparison between predicted and measuredconcurrent kernel execution IPCs on two GPUs withfixed one-to-one slice ratio.

ratio = 1 : 1. We observed similar results on other fixedratios. Regardless of different kernel combinations andslicing ratios, our model is able to well capture the trendof concurrent executions for both dynamic and staticslice ratios.

Model Optimizations. We further evaluate the im-pact of incorporating coalesced/uncoalesced memoryaccesses and the number of warp schedulers on the GPU.Only two applications (PC and SPMV) in our bench-mark have uncoalesced memory accesses. We conductthe single kernel execution prediction experiments by(wrongly) assuming those two kernels with coalescedmemory accesses only. The results are shown in Fig-ure 10. Without considering uncoalesced access, the pre-dicted IPC values are much larger than measurementssince the assumption of coalesced access only underes-timates the memory contention effects.

Figure 11 shows the results of concurrent executionIPC prediction on GTX680 without considering the mul-tiple warp schedulers. The estimation without consider-ing the number of warp schedulers severely underesti-mates the IPC on GTX680, in comparison with the resultsin Figure 8.

CP Prediction. We further evaluate the accuracy ofCP prediction. Figure 12 shows the comparison betweenmeasured and predicted CP on C2050. We observe sim-ilar results on GTX680. The prediction is close to themeasurement. With accurate prediction on IPC , the CP

difference between prediction and measurement is small.

0

0.1

0.2

0.3

0.4

0.5

0.6

PC SPMV

IP

C

measurement

w/ uncoalesced

access

w/o uncoalesced

access

Fig. 10: Comparison between predicted and measuredconcurrent kernel execution IPCs with/without consid-ering uncoalesced access on C2050.

0

1

2

3

4

5

6

7

8

0 1 2 3 4 5 6 7 8

Pred

icte

d I

PC

Measured IPC

Fig. 11: Comparison between predicted and measuredconcurrent kernel execution IPCs without consideringmultiple warp schedulers on GTX680.

The results are sufficiently good to guide the schedulingdecision as shown in the next section.

5.4 Results on Kernel Scheduling

In this section, we evaluate the effectiveness of ourkernel scheduling algorithm by comparing with BASEand OPT. To simulate the continuous kernel submissionprocess, we initiate 1000 instances for each of eachkernel mix and submit them for execution accordingto Poisson distributions. Different scheduling algorithmsare applied and the total kernel execution time is re-ported. Figure 13 shows the total execution time of thosekernels on C2050 and GTX680. On all the four workloadswith different memory and computation characteristics,

0

0.1

0.2

0.3

0.4

0.5

-0.1 0 0.1 0.2 0.3 0.4 0.5

Pred

icte

d C

P

Measured CP

Fig. 12: Comparison between predicted and measuredCP on C2050.

12

0

500000

1000000

1500000

2000000

2500000

3000000

3500000

CI MI MIX ALL

Ex

ecu

tio

n T

ime (

ms)

Workload

Base

Kernelet

Oracle

(a) C2050

0

500000

1000000

1500000

2000000

2500000

CI MI MIX ALL

Ex

ecu

tio

n T

ime (

ms)

Workload

Base

Kernelet

Oracle

(b) GTX680

Fig. 13: Comparison between different scheduling methods on both C2050 and GTX680.

TABLE 6: Number of kernels pruned with varying αp

and αm on C2050.

PPPPP

αm

αp 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

0.015 0 0 2 2 3 4 4 4 4 40.03 0 2 5 6 7 8 9 9 9 90.045 0 3 7 8 9 10 11 11 11 110.06 1 4 8 9 12 13 15 15 15 150.075 1 4 8 9 12 13 15 15 15 150.09 1 4 8 9 12 13 15 15 16 160.105 1 4 9 10 14 15 18 18 20 200.12 2 6 11 12 17 18 21 22 24 240.135 2 6 11 12 17 19 22 23 25 260.15 2 6 11 13 18 20 23 24 27 28

Kernelet outperforms Base (with the improvement 5.0–31.1% for C2050 and 6.7–23.4% for GTX680). Kerneletachieves similar performance to OPT (with the difference0.7-3.1% for C2050 and 4.0–15.0% for GTX680). Theperformance improvement of Kernelet over Base is moresignificant on MIX and ALL, because Kernelet have morechances to select kernel pairs with complementary re-source usage. Still, Kernelet outperforms Base in CI andMI, because slicing exposes the scheduling opportunities(even though they are small on CI and MI).

Table 6 shows the number of kernels pruned withdifferent pruning parameters αp and αm on C2050. In-creasing αp and αm leads to more kernel combinationsbeing pruned. Similar pruning result is oberserved forGTX680. Varying those two parameters can affect thepruning power and also the optimization opportunities.Thus, we choose the default values for αp and αm as0.4 and 0.1 on C2050, and 0.4 and 0.105 on GTX680,respectively, as a tradeoff between pruning power andoptimization opportunities.

We finally study the execution time distribution of thescheduling candidate space. Figure 14 shows the CDF(cumulative distribution function) of the execution timeof the MC(1000). As we can see from the figure, none ofthe random schedules is better than Kernelet. It demon-strates that random co-schedules hurt the performancein a high probability due to the huge space of scheduleplans.

0

10

20

30

40

50

60

70

80

90

100

2665000 2670000 2675000 2680000 2685000 2690000 2695000

Pro

ba

bil

ity

(%

)

Execution Time (ms)

Fig. 14: CDF (cumulative distribution function) of exe-cution time of MC(1000).

6 RELATED WORK

In this section, we review the related work in two cate-gories: 1) scheduling algorithms on CPUs, especially forCPUs with SMT (Simultaneous Multi-Threading), and 2)multi-kernel executions on GPUs.

6.1 CPU Scheduling and Performance Modeling

SMT has been an effective hardware technology tobetter utilize CPU resources. SMT allows instructionsfrom multiple threads to be executed on the instructionpipeline at the same time. Various SMT aware schedul-ing techniques have been proposed to increase the CPUutilization [8], [24], [39]–[42]. The core idea of SMT awarescheduling is to co-schedule threads with complemen-tary resource requirements. Several mechanisms havebeen proposed to select the threads to be executed onthe same SMT core, such as hardware counters feed-back [24], [41], pre-execution of all combinations [39]and probabilistic job symbiosis model [8]. Performancemodels including the Markov chain based ones havealso been adopted for concurrent tasks modeling onthe CPUs. Serrano et al. [36], [37] developed a modelto estimate the instruction throughput of super-scalarprocessors executing multiple instruction streams. Chen

13

et al. [6] proposed an analytical model for estimatingthroughput of multi-threaded CPUs.

Despite the fruitful research results on SMT schedul-ing, they are not applicable to GPUs due to the architec-ture differences. First, main performance affecting issuesare different for CPU and GPU applications. L2 datacache is a key consideration issue for SMT-aware threadscheduling, whereas thread parallelism is usually moreimportant for the performance of GPGPU programs [2],[19], [38]. Second, scheduling on GPUs is not as flexibleas that on CPUs. Current GPUs do not support taskpreemption. Third, unlike CPUs supporting the concur-rent execution of a relatively small number of threads,each GPGPU kernel launches thousands of threads.Additionally, the maximum number of co-schedulingthreads equals to the number of hardware context onthe CPU, while the number of active warps on GPUs isdynamic, depending on the resource usage of the threadblocks. The slicing, scheduling and performance modelsin Kernelet are specifically designed for GPUs, takingthose issues into consideration.

6.2 GPU Multiple Kernel Execution and Sharing

In the past few years, GPU architectures have under-gone significant and rapid improvements for GPGPUsupport. Due to lack of concurrent kernel support inearly GPU architectures, researchers initially proposedto merge two kernels at the source code level [12], [14].In those methods, two kernels are combined into a singlekernel with if-else branches on different granularities(e.g., thread blocks). They have three major disadvan-tages compared with our approach. First, combining thecode of two kernels will increase the resource usage ofeach thread block, leading to lower SM occupancy andperformance degradation [29]. Second, those approachesrequire source code, which may not be always availablein the shared environments. Third, it requires two ker-nels with different block sizes avoiding using barrierswithin the thread block, otherwise deadlock may occur.

Recently, new-generation GPUs like NVIDIA FermiGPUs support concurrent kernel executions. Taking ad-vantage of this new capability, a number of multi-kernel optimization techniques [15], [23], [35] have beendeveloped to improve the utilization of GPUs. Ravi etal. [34] proposed kernel consolidation to enable spacesharing (different kernels run on different SMs) and timesharing (multiple kernels reside on the same SM) onGPUs. Space sharing happens when the total number ofthread blocks of all kernels does not exceed the numberof SMs and each block can be executed on a dedicatedSM. If the total number of thread blocks is larger than thenumber of SMs, while SMs have sufficient resources toaccommodate more thread blocks from different kernels,time sharing happens. That means, kernel consolidationdoes not have space sharing and have little time sharingwhen the launched kernels have sufficient thread blocksto occupy the GPU. Furthermore, they determined the

kernel to be consolidated with heuristics based on thenumber of thread blocks. In contrast, Kernelet utilizesslicing to create more opportunities for time sharing, anddevelops a performance model to guide the schedulingdecision. Peters et al. [32] used a persistently runningkernel to handle requests from multiple applications.GPU virtualization has also been investigated [15], [23].

Recent studies also address the problem of GPUscheduling when multiple users co-reside in one ma-chine. Pegasus [16] coordinates computing resources likeaccelerators and CPU and provides a uniform resourceusage model. Timegraph [22] and PTask [35] manage theGPU at the operating system level. Kato et al. [21] intro-duced the responsive GPGPU execution model (RGEM).All those scheduling methods do not consider how toschedule concurrent kernels in order to fully utilize theGPU resources.

As for performance models on GPUs, Hong [19]and Kim [38] proposed analytical models based on theround-robin warp scheduling assumption. Baghsorkhi etal. [2] introduced the work flow graph interpretationof GPU kernels to estimate their execution time. Allthose models are designed for a single kernel. More-over, they usually require extensive hardware profilingand/or simulation processes. In contrast, our perfor-mance model is designed for concurrent kernel execu-tions on the GPU, and relies on a small set of keyperformance factors of individual kernel to predict theperformance of concurrent kernel executions.

7 CONCLUSION

Recently, GPUs have been more and more widely usedin clusters and cloud environments, where many kernelsare submitted and executed on the shared GPUs. Thispaper proposes Kernelet to improve the throughputof concurrent kernel executions for such shared envi-ronments. Kernelet creates more sharing opportunitieswith kernel slicing, and uses a probabilistic performancemodel to capture the non-deterministic performance fea-tures of multiple-kernel executions. We evaluate Kerneleton two NVIDIA GPUs, Tesla C2050 and GTX680, withFermi and Kepler architectures respectively. Our exper-iments demonstrate the accuracy of our performancemodel, and the effectiveness of Kernelet by improvingthe concurrent kernel executions by 5.0–31.1% and 6.7–23.4% on C2050 and GTX680 on our workloads, respec-tively.

REFERENCES

[1] asfermi: An assembler for the nvidia fermi instruction set.http://code.google.com/p/asfermi/, accessed on Dec 17th, 2012.

[2] S. S. Baghsorkhi, M. Delahaye, S. J. Patel, W. D. Gropp, and W.-m. W. Hwu. An adaptive performance modeling tool for GPUarchitectures. SIGPLAN Not., 45(5):105–114, Jan. 2010.

[3] S. S. Baghsorkhi, I. Gelado, M. Delahaye, and W.-m. W. Hwu.Efficient performance evaluation of memory hierarchy for highlymultithreaded graphics processors. In Proc. of PPoPP ’12.

[4] N. Bell and M. Garland. Cusp: Generic parallel algo-rithms for sparse matrix and graph computations version 0.3.0.http://cusp-library.googlecode.com, accessed on Dec 17th, 2012.

http://code.google.com/p/asfermi/

http://cusp-library.googlecode.com

14

[5] G. Chaitin. Register allocation & spilling via graph coloring. InACM Sigplan Notices, volume 17, pages 98–105. ACM, 1982.

[6] X. Chen and T. Aamodt. A first-order fine-grained multithreadedthroughput model.

[7] J. Duato, A. Pena, F. Silla, R. Mayo, and E. Quintana-Orti.[8] S. Eyerman and L. Eeckhout. Probabilistic job symbiosis modeling

for smt processor scheduling. In Proc. of ASPLOS ’10.[9] M. Garland and D. B. Kirk. Understanding throughput-oriented

architectures. Commun. ACM, 53(11), Nov. 2010.[10] N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. GPUTera-

Sort: high performance graphics co-processor sorting for largedatabase management. In Proc. of SIGMOD’06.

[11] N. K. Govindaraju, B. Lloyd, Y. Dotsenko, B. Smith, and J. Manfer-delli. High performance discrete fourier transforms on graphicsprocessors. In ACM/IEEE SuperComputing ’08, 2008.

[12] C. Gregg, J. Dorn, K. Hazelwood, and K. Skadron. Fine-grainedresource sharing for concurrent GPGPU kernels. 0:389–398.

[13] I. R. Group et al. Parboil benchmark suite, 2007.[14] M. Guevara, C. Gregg, K. Hazelwood, and K. Skadron. Enabling

task parallelism in the cuda scheduler. In Workshop on Program-ming Models for Emerging Architectures (PMEA), 2009, page 69C76.

[15] V. Gupta, A. Gavrilovska, K. Schwan, H. Kharche, N. Tolia,V. Talwar, and P. Ranganathan. Gvim: GPU-accelerated virtualmachines. In Proceedings of the 3rd ACM Workshop on System-levelVirtualization for High Performance Computing, HPCVirt ’09, pages17–24, New York, NY, USA, 2009. ACM.

[16] V. Gupta, K. Schwan, N. Tolia, V. Talwar, and P. Ranganathan.Pegasus: coordinated scheduling for virtualized accelerator-basedsystems. In Proc. of USENIXATC’11.

[17] B. He, M. Lu, K. Yang, R. Fang, N. K. Govindaraju, Q. Luo,and P. V. Sander. Relational query coprocessing on graphicsprocessors. ACM Trans. Database Syst., 34(4):1–39, 2009.

[18] B. He, K. Yang, R. Fang, M. Lu, N. Govindaraju, Q. Luo, andP. Sander. Relational joins on graphics processors. In Proc. ofSIGMOD ’08.

[19] S. Hong and H. Kim. An analytical model for a GPU architecturewith memory-level and thread-level parallelism awareness. InProc. of ISCA’09.

[20] Y. Jiang, X. Shen, J. Chen, and R. Tripathi. Analysis and approxi-mation of optimal co-scheduling on chip multiprocessors. In Proc.of PACT’08.

[21] S. Kato, K. Lakshmanan, A. Kumar, M. Kelkar, Y. Ishikawa, andR. R. Rajkumar. Rgem: A responsive gpgpu execution model forruntime engines. In Proc. of RTSS’11.

[22] S. Kato, K. Lakshmanan, R. Rajkumar, and Y. Ishikawa. Time-Graph: GPU scheduling for real-time multi-tasking environments.In Proc. of USENIXATC’11.

[23] T. Li, V. Narayana, E. El-Araby, and T. El-Ghazawi. In Proc. ofICPP’11.

[24] T. Moseley, D. Grunwald, J. L. Kihm, and D. A. Connors. Methodsfor modeling resource contention on simultaneous multithreadingprocessors. In Proc. of ICCD’05.

[25] R. Nath, S. Tomov, T. T. Dong, and J. Dongarra. Optimizingsymmetric dense matrix-vector multiplication on GPUs. In Proc.of SC’11.

[26] A. Nukada, Y. Ogata, T. Endo, and S. Matsuoka. Bandwidthintensive 3-D FFT kernel for gpus using cuda. In Proc. of SC’08.

[27] NVIDIA. NVIDIA Fermi Compute Architecture Whiltepaper, v1.1edition.

[28] NVIDIA. NVIDIA GeForce GTX 680 Whiltepaper, v1.0 edition.[29] NVIDIA. NVIDIA CUDA C Programming Guide 4.2, 2012.[30] NVIDIA CUDA. http://developer.nvidia.com/object/cuda.html.[31] A. Papoulis and R. Probability. Stochastic processes, volume 3.

McGraw-hill New York, 1991.[32] H. Peters, M. Koper, and N. Luttenberger. Efficiently using a

CUDA-enabled GPU as shared resource. In Proc. of CIT’10.[33] M. Poletto and V. Sarkar. Linear scan register allocation. ACM

Transactions on Programming Languages and Systems (TOPLAS),21(5):895–913, 1999.

[34] V. T. Ravi, M. Becchi, G. Agrawal, and S. Chakradhar. SupportingGPU sharing in cloud environments with a transparent runtimeconsolidation framework. In Proc. of HPDC’11.

[35] C. J. Rossbach, J. Currey, M. Silberstein, B. Ray, and E. Witchel.PTask: operating system abstractions to manage gpus as computedevices. In Proc. of SOSP’11.

[36] M. Serrano. Performance estimation in a simultaneous multi-threading processor. In Proc. of MASCOTS’96.

[37] M. Serrano, W. Yamamoto, R. Wood, and M. Nemirovsky. Amodel for performance estimation in a multistreamed superscalarprocessor. In Computer Performance Evaluation Modelling Techniquesand Tools, volume 794 of Lecture Notes in Computer Science, pages213–230. 1994.

[38] J. Sim, A. Dasgupta, H. Kim, and R. Vuduc. A performanceanalysis framework for identifying potential benefits in GPGPUapplications. In Proc. of PPoPP’12.

[39] A. Snavely and D. M. Tullsen. Symbiotic jobscheduling for asimultaneous multithreaded processor. In Proceedings of the ninthinternational conference on Architectural support for programminglanguages and operating systems, ASPLOS-IX, 2000.

[40] A. Snavely, D. M. Tullsen, and G. Voelker. Symbiotic jobschedul-ing with priorities for a simultaneous multithreading processor.In Proc. of SIGMETRICS’02.

[41] S. E. Sujay Parekh and H. Levy. Thread-sensitive schedulingfor SMT processors. Technical report 2000-04-02, University ofWashington.

[42] D. Tam, R. Azimi, and M. Stumm. Thread clustering: sharing-aware scheduling on smp-cmp-smt multiprocessors. In Pro-ceedings of the 2nd ACM SIGOPS/EuroSys European Conference onComputer Systems 2007, EuroSys ’07, 2007.

[43] V. Volkov and J. W. Demmel. Benchmarking GPUs to tune denselinear algebra. In Proc. of SC’08.

[44] Y. Zhang and J. D. Owens. A quantitative performance analysismodel for gpu architectures. In Proc. of HPCA’11, 2011.

[45] Zillians. V-GPU: GPU virtualization.http://www.zillians.com/products/vgpu-gpu-virtualization/,accessed on Dec 17th, 2012.

http://developer.nvidia.com/object/cuda.html

http://www.zillians.com/products/vgpu-gpu-virtualization/

Kernelet: High-Throughput GPU Kernel Executions with ...The graphics processing unit (or GPU) has...

Documents

Transcript of Kernelet: High-Throughput GPU Kernel Executions with ...The graphics processing unit (or GPU) has...