The Performance Potential for Single Application ... Performance Potential for Single Application...

The Performance Potential for Single Application

Heterogeneous Systems∗

Henry Wong

University of Toronto

[email protected]

Tor M. Aamodt

University of British Columbia

[email protected]

Abstract

A consideration of Amdahl’s Law [9] suggests asingle-chip multiprocessor with asymmetric cores is apromising way to improve performance [16]. In thispaper, we conduct a limit study of the potential benefitof the tighter integration of a fast sequential core de-signed for instruction level parallelism (e.g., an out-of-order superscalar) and a large number of smaller coresdesigned for thread-level parallelism (e.g., a graphicsprocessor). We optimally schedule instructions acrosscores under assumptions used in past ILP limit studies.We measure sensitivity to the sequential performance(instruction read-after-write latency) of the low-costparallel cores, and latency and bandwidth of the com-munication channel between these cores and the fastsequential core. We find that the potential speedup oftraditional “general purpose” applications (e.g., thosefrom SpecCPU) as well as a heterogeneous workload(game physics) on a CPU+GPU system is low (2.2×to 12.7×), due to poor sequential performance of theparallel cores. Communication latency and bandwidthhave comparatively small performance impact (1.07× to1.48×) calling into question whether integrating ontoone chip both an array of small parallel cores and alarger core will, in practice, benefit the performance ofthese workloads significantly when compared to a sys-tem using two separate specialized chips.

1 Introduction

As the number of cores integrated on a single chipcontinues to increase, the question of how useful ad-ditional cores will be is of intense interest. Recently,Hill and Marty [16] combined Amdahl’s Law [9] andPollack’s Rule [28] to quantify the notion that single-

∗Work done while the first author was at the University of

British Columbia.

chip asymmetric multicore processors may provide bet-ter performance than using the same silicon area for asingle core or some number of identical cores. In thispaper we take a step towards refining this analysis byconsidering real workloads and their behavior sched-uled on an idealized machine while modeling commu-nication latency and bandwidth limits.

Heterogeneous systems typically use a traditionalmicroprocessor core optimized for extracting instruc-tion level parallelism (ILP) for serial tasks, while of-floading parallel sections of algorithms to an array ofsmaller cores to efficiently exploit available data and/orthread level parallelism. The Cell processor [10] is aheterogeneous multicore system, where a traditionalPowerPC core resides on the same die as an arrayof eight smaller cores. Existing GPU compute sys-tems [22, 2] typically consist of a GPU with a discreteGPU attached via a card on a PCI Express bus. Al-though development of CPU-GPU single-chip systemshas been announced [1], there is little published infor-mation quantifying the benefits of such integration.

One common characteristic of heterogeneous multi-core systems employing GPUs is that the small mul-ticores for exploiting parallelism are unable to executea single thread of execution as fast as the larger se-quential processor in the system. For example, recentGPUs from NVIDIA have a register to register read-after-write latency equivalent to 24 shader clock cy-cles [25]1. This latency is due in part to the use of finegrained multithreading [32] to hide memory access andarithmetic logic unit latency [13]. Our limit study isdesigned to capture this effect.

While there have been previous limit studies onparallelism in the context of single-threaded machines[7, 17, 15], and homogeneous multicore machines [21], aheterogeneous system presents a different set of trade-

1The CUDA programming manual indicates 192 threads are

required to hide read-after-write latency within a single thread,

there are 32-threads per warp, and each warp is issued over four

clock cycles.

offs. It is no longer merely a question of how much par-allelism can be extracted, but also whether the paral-lelism is sufficient considering the lower sequential per-formance (higher register read-after-write latency) andcommunication overheads between processors. Fur-thermore, applications with sufficient thread-level par-allelism to hide communication latencies may diminishthe need for a single-chip heterogeneous system exceptwhere system cost considerations limit total silicon areato that available on a single-chip.

This paper makes the following contributions:

• We perform a limit study of an optimistic hetero-geneous system consisting of a sequential proces-sor and a parallel processor, modeling a traditionalCPU and an array of simpler cores for exploitingparallelism. We use a dynamic programming algo-rithm to choose points along the instruction tracewhere mode switches should occur such that thetotal runtime of the trace, including the penaltiesincurred for switching modes, is minimized.

• We show the parallel processor array’s sequen-tial performance (read-after-write latency) rela-tive to the performance of the sequential processor(CPU core) is a significant limitation on achievablespeedup for a set of general-purpose applications.Note this is not the same as saying performanceis limited by the serial portion of the computa-tion [9].

• We find that latency and bandwidth between thetwo processors have comparatively minor effectson speedup.

In the case of a heterogeneous system using a GPU-like parallel processor, speedup is limited to only 12.7×for SPECfp 2000, 2.2× for SPECint 2000, and 2.5× forPhysicsBench [31]. When connecting the GPU using anoff-chip PCI Express-like bus, SPECfp achieves 74%,SPECint 94%, and PhysicsBench 82% of the speedupachievable without latency and bandwidth limitations.

We present our processor model in Section 2,methodology in Section 3, analyze our results in Sec-tion 4, review previous limit studies in Section 5, andconclude in Section 6.

2 Modeling a Heterogeneous System

We model heterogeneous systems as having two pro-cessors with different characteristics (Figure 1). Thesequential processor models a traditional processor coreoptimized for ILP, while the parallel processor mod-els an array of cores for exploiting thread-level paral-

SequentialProcessor

SequentialProcessor

ParallelProcessor

ParallelProcessor

MemMem Mem

a. b.

➊

➋

Figure 1. Conceptual Model of a Heteroge-neous System. Two processors with differentcharacteristics (a) may, or (b) may not sharememory.

lelism. The parallel processor models an array of low-cost cores by allowing parallelism, but with a longerregister read-after-write latency than the sequentialprocessor. The two processors may communicate overa communication channel whose latency is high andbandwidth is limited when the two processors are onseparate chips. We assume that the processors are at-tached to ideal memory systems. Specifically, for de-pendencies between instructions within a given core(sequential or parallel) we assume store-to-load com-munication has the same latency as communication viaregisters (register read-after-write latency) on the samecore. Thus, the effects of long latency memory accessfor the parallel core (assuming GPU-like fine-grainedmulti-threading to tolerate cache misses) is captured inthe long read-to-write delay. The effects of caches andprefetching on the sequential processor core are cap-tured by its relatively short read-to-write delay. Wemodel a single-chip system (Figure 1(a)) with sharedmemory by only considering synchronization latency,potentially accomplished via shared memory (➊) andon-chip coherent caches [33]. We model a system withprivate memory (Figure 1(b)) by limiting the commu-nication channel’s bandwidth and imposing a latencywhen data needs to be copied across the link (➋) be-tween the two processors.

Section 2.1 and 2.2 describe each portion of ourmodel in more detail. In Section 2.3 we describe ouralgorithm for partitioning and scheduling an instruc-tion trace to optimize its runtime on the sequentialand parallel processors.

2.1 Sequential Processor

We model the sequential processor as being ableto execute one instruction per cycle (CPI of one).This simple model has the advantage of having pre-dictable performance characteristics that make the op-timal scheduling (Section 2.3) of work between sequen-tial and parallel processors feasible. It preserves the

essential characteristic of high-ILP processors that aprogram is executed serially, while avoiding the mod-eling complexity of a more detailed model. Althoughthis simple model does not capture the CPI effects of asequential processor which exploits ILP, we are mainlyinterested in the relative speeds between the sequentialand parallel processors. We account for sequential pro-cessor performance due to ILP by making the parallelprocessor relatively slower. In the remainder of thispaper, all time periods are expressed in terms of thesequential processor’s cycle time.

2.2 Parallel Processor

We model the parallel processor as a dataflow pro-cessor, where a data dependency takes multiple cy-cles to resolve. This dataflow model is driven by ourtrace based limit study methodology described in Sec-tion 3.2, which assumes perfectly predicted branchesto uncover parallelism. Using a dataflow model, weavoid the requirement of partitioning instructions intothreads, as done in the thread-programming model.This allows us to model the upper bound of parallelismfor future programming models that may be more flex-ible than threads.

The parallel processor can execute multiple instruc-tions in parallel, provided data dependencies are satis-fied. Slower sequential performance of the parallel pro-cessor is modeled by increasing the latency from thebeginning of an instruction’s execution until the timeits result is available for a dependent instruction. Wedo not limit the parallelism that can be used by theprogram, as we are interested in the amount of paral-lelism available in algorithms.

Our model can represent a variety of methods ofbuilding parallel hardware. In addition to an arrayof single-threaded cores, it can also model cores us-ing fine-grain multithreading, like current GPUs. Notethat modern GPUs from AMD [3] and NVIDIA [23,24] provide a scalar multithreaded programming ab-straction even though the underlying hardware issingle-instruction, multiple data (SIMD). This execu-tion mode has been called single-instruction, multiplethread (SIMT) [14].

In GPUs, fine-grain multithreading creates the il-lusion of a large amount of parallelism (>10,000s ofthreads) with low per-thread performance, althoughphysically there is a lower amount of parallelism (100sof operations per cycle), high utilization of the ALUs,and frequent thread switching. GPUs use the largenumber of threads to “hide” register read-after-writelatencies and memory access latencies by switching toa ready thread. From the perspective of the algorithm,

a GPU appears as a highly-parallel, low-sequential-performance parallel processor.

To model current GPUs, we use a register read-after-write latency of 100 cycles. For example, cur-rent Nvidia GPUs have a read-after-write latency of 24shader clocks [25] and a shader clock frequency of 1.3-1.5 GHz [23, 24]. The 100 cycle estimates includes theeffect of instruction latency (24×), the difference be-tween the shader clock and current CPU clock speeds(about 2×), and the ability of current CPUs to ex-tract ILP—we assume an average IPC of 2 on currentCPUs, resulting in another factor of 2×. We ignorefactors such as SIMT branch divergence [8].

We note that the SPE cores on the Cell processorhave comparable read-after-write latency to the moregeneral purpose PPE core. However, the SPE cores arenot optimized for control-flow intensive code [10] andthus may potentially suffer a higher “effective” read-after-write latency on some general purpose code (al-though quantifying such effects is beyond the scope ofthis work).

2.3 Heterogeneity

We model a heterogeneous system by allowing analgorithm to choose between executing on the sequen-tial processor or parallel processor and to switch be-tween them (which we refer to as a “mode switch”).We do not allow concurrent execution of both proces-sors. This is a common paradigm, where a parallelsection of work is spawned off to a co-processor whilethe main processor waits for the results. The runtimedifference for optimal concurrent processing (e.g., as inthe asymmetric multicore chips analysis given by Hilland Marty [16]) is no better than 2× compared to notallowing concurrency.

We schedule an instruction trace for alternating ex-ecution on the two processors. Execution of a trace oneach type of core was described in Sections 2.1 and 2.2.For each mode switch, we impose a “mode switch cost”,intuitively modeling synchronization time during whichno useful work is performed. The mode switch cost isused to model communication latency and bandwidthas described in Sections 2.3.2 and 2.3.3, respectively.Next we describe our scheduling algorithm in more de-tail.

2.3.1 Scheduling Algorithm

Dynamic Programming is often applied to find optimalsolutions to optimization problems. The paradigm re-quires that an optimal solution to a problem be recur-sively decomposable into optimal solutions of smaller

sub-problems, with the solutions to the sub-problemscomputed first and saved in a table to avoid re-computation [6].

In our dynamic programming algorithm, we aim tocompute the set of mode switches (i.e., scheduling) ofthe given instruction trace that will minimize execu-tion time, given the constraints of our model. Wedecompose the optimal solution to the whole traceinto sub-problems that are optimal solutions to shortertraces with the same beginning, with the ultimate sub-problem being the trace with only the first instructionthat can be trivially scheduled. We recursively define asolution to a longer trace by observing that a solutionto a long trace is composed of a solution to a shortersub-trace, followed by a decision on whether to performa mode switch, followed by execution of the remaininginstructions in the chosen mode.

The dynamic programming algorithm keeps a N×2state table when given an input trace of N instructions.Each entry in the state table records the cost of an op-timal scheduling for every sub-trace (N of them) andmode that was last used in those sub-traces (2 modes).At each step of the algorithm, a solution for the nextsub-trace requires examining all possible locations ofthe previous mode switch to find the one that gives thebest schedule. For each possible mode switch location,the corresponding entry of the state table is examinedto retrieve the optimal solution for the sub-trace thatexecutes all instructions up to that entry in the cor-responding state (execution on the sequential, or theparallel core, respectively). This value is used to com-pute a candidate state table entry for the current stepby adding the mode switch cost (if switching modes),and the cost to execute the remaining section of thetrace from the candidate switch point up to the currentinstruction in the current mode (sequential, parallel).The lowest cost candidate over all earlier sub-traces ischosen for the current sub-trace.

The naive optimal algorithm described above runsin quadratic time with respect to the instruction tracelength. For traces of millions of instructions in length,quadratic time is too slow. We make an approxima-tion to enable the algorithm to run in time linear inthe length of the instruction trace. Instead of lookingback at all past instructions for each potential modeswitch point, we only look back 30,000 instructions.The modified algorithm is no longer optimal. We miti-gate this sub-optimality by first reordering instructionsbefore scheduling. We observed that the amount ofsub-optimality using this approach is insignificant.

To overcome the limitation of looking back only30,000 instructions in our algorithm, we reorderinstructions in dataflow order before scheduling.

Dataflow order is the order in which instructions wouldexecute if scheduled with our optimal scheduling al-gorithm. This linear-time preprocessing step exposesparallelism found anywhere in the instruction trace bygrouping together instructions that can execute in par-allel.

We remove instructions from the trace that do notdepend on the result of any other instruction. Most ofthese instructions are dead code created by our methodof exposing loop- and function-level parallelism, de-scribed in Section 3.2. Since dead code can executein parallel, we remove these instructions to avoid hav-ing them inflate the amount of parallelism we observe.Across our benchmark set, 27% of instructions are re-moved by this mechanism. Note that the dead code weare removing is not necessarily dynamically dead [5],but rather overhead related to sequential execution ofparallel code. The large number of instructions re-moved results from, for example, the expansion of x86push and pop instructions (for register spills/fills) intoa load or store micro-op (which we keep) and a stack-pointer update micro-op (which we do not keep).

2.3.2 Latency

We model the latency of migrating tasks between pro-cessors by imposing a constant runtime cost for eachmode switch. This cost is intended to model the la-tency of spawning a task, as well as transferring ofdata between the processors. If the amount of datatransferred is large relative to the bandwidth of thelink between processors, this is not a good model forthe cost of a mode switch. This model is reasonablewhen the mode switch is dominated by latency, for ex-ample in a heterogeneous multicore system where thememory hierarchy is shared (Figure 1(a)), so very littledata needs to be copied between the processors.

As described in Section 2.3, our scheduling algo-rithm considers the cost of mode switches. A modeswitch cost of zero would allow freely switching be-tween modes, while a very high cost would constrainthe scheduler to choose to run the entire trace on oneprocessor or the other, whichever was faster.

2.3.3 Bandwidth

Bandwidth is a constraint that limits the rate that datacan be transferred between processors in our model.Note that this does not apply to the processors’ link toits memory (Figure 1), which we assume to be uncon-strained. In our shared-memory model (Figure 1(a))mode switches do not need to copy large amounts ofdata so only latency (Section 2.3.2) is a relevant con-straint. In our private-memory model (Figure 1(b)),

bandwidth is consumed on the link connecting proces-sors as a result of a mode switch.

If a data value is produced by an instruction in oneprocessor and consumed by one or more instructions inthe other processor, then that data value needs to becommunicated to the other processor. A consequenceof exceeding the imposed bandwidth limitation is theaddition of idle computation cycles while an instruc-tion waits for its required operand to be transferred.In our model, we assume opportunistic use of band-width, allowing communication of a value as soon as itis ready, in parallel with computation.

Each data value to be transferred is sent sequentiallyand occupies the communication channel for a specificamount of time. Data values can be sent any time afterthe instruction producing the value executes, but mustarrive before the first instruction that consumes thevalue is executed. Data transfers are scheduled ontothe communication channel using an “earliest deadlinefirst” algorithm, which produces a scheduling with aminimum of added idle cycles.

Bandwidth constraints are applied by changing theamount of time each data value occupies on the com-munication channel. Communication latency is appliedby setting the deadline for a value some number of cy-cles after the value is produced.

Computing the bandwidth requirements and idle cy-cles needed, and thus the cost to switch modes, requiresa scheduling of the instruction trace, but the optimalinstruction trace scheduling is affected by the cost ofswitching modes. We approximate the ideal behav-ior by iteratively performing scheduling using a con-stant mode switch overhead for each mode switch andthen updating the average penalty due to bandwidthconsumption across all mode switches, then using thenew estimate of average switch cost as input into thescheduling algorithm, until convergence.

3 Simulation Infrastructure

We evaluate performance using micro-op traces ex-tracted from execution of a set of x86-64 benchmarkson the PTLsim [18] simulator. Each micro-op tracewas then scheduled using our scheduling algorithm forexecution on the heterogeneous system.

3.1 Benchmark Set

We chose our benchmarks with a focus towardsgeneral-purpose computing. We used the referenceworkloads for SPECint and SPECfp 2000 v1.3.1 (23benchmarks, except 253.perlbmk and 255.vortex which

Benchmark Descriptionlinear Compute average of 9 input pixels for

each output pixel. Each pixel is inde-pendent.

sepia 3× 3 constant matrix multiply on eachpixel’s 3 components. Each pixel is in-dependent.

serial A long chain of dependent instructions,has parallelism approximately 1 (noparallelism).

twophase Loops through two alternating phases,one with no parallelism, one with highparallelism. Needs to switch betweenprocessor types for high speedup.

Table 1. Microbenchmarks

did not run in our simulation environment), Physics-Bench 2.0 [31] (8 benchmarks), SimpleScalar 3.0 [29](used here as a benchmark), and four small mi-crobenchmarks (described in Table 1).

We chose PhysicsBench because it contains both se-quential and parallel phases in the benchmark, andwould be a likely candidate to benefit from heterogene-ity, as it would be unsatisfactory if both types of phaseswere constrained to one processor type [31].

Our SimpleScalar benchmark used the out-of-orderprocessor simulator from SimpleScalar/PISA, runninggo from SPECint 95, compiled for PISA.

We used four microbenchmarks to observe behav-ior at extremes of parallelism, as shown in Table 1.Linear and sepia are highly parallel, serial is serial,and twophase has alternating highly parallel and serialphases.

Figure 2 shows the average parallelism present in ourbenchmark set. As expected, SPECfp has more par-allelism (611) than SPECint (116) and PhysicsBench(83). Linear (4790) and sepia (6815) have the high-est parallelism, while serial has essentially no paral-lelism.

3.2 Traces

Micro-op traces were collected from PTLsim run-ning x86-64 benchmarks, compiled with gcc 4.1.2 -O2.Four microbenchmarks were run in their entirety, whilethe 32 real benchmarks were run through SimPoint [30]to choose representative sub-traces to analyze. Ourtraces are captured at the micro-op level, so in this pa-per instruction and micro-op are used interchangeably.

We used SimPoint to select simulation points of10-million micro-ops in length from complete runs ofbenchmarks. As recommended [30], we allowed Sim-

1

10

100

1000

10000

Parallelism

Figure 2. Average Parallelism of Our Benchmark Set

Point to decide how many simulation points should beused to approximate the entire benchmark run. We av-eraged 12.9 simulation points per benchmark. This is asignificant savings over the complete benchmarks whichwere typically several hundred billion instructions long.The weighted average of the results over each set ofSimPoint traces are presented for each benchmark.

We assume branches are correctly predicted. Manybranches, like loops, can often be easily predicted orspeculated or even restructured away during manualparallelization. As we are trying to evaluate the upper-bound of parallelism in an algorithm, we avoid limitingparallelism by not imposing the branch-handling char-acteristics of sequential machines. This is somewhatoptimistic as true data-dependent branches would atleast need be converted into speculation or predicatedinstructions.

Each trace is analyzed for true data dependencies.Register dependencies are recognized if an instructionconsumes a value produced by an earlier instruction(read-after-write). Dependencies on values carried bythe instruction pointer register are ignored, to avoiddependencies due to instruction-pointer-relative dataaddressing. Like earlier limit studies [17, 15], stackpointer register manipulations are ignored, to extractparallelism across function calls. Memory disambigua-tion is perfect: Dependencies are carried through mem-ory only if an instruction loads a value from memoryactually written by an earlier instruction.

It is also important to be able to extract loop-levelparallelism and avoid serialization of loops through theloop induction variable. We implemented a generic so-lution to prevent this type of serialization. We identifyinstructions that produce result values that are stati-

cally known, which are instructions that have no inputoperands (e.g. load constant). We then repeatedlylook for instructions dependent only on values that arestatically known and mark the values they produce asstatically known as well. We then remove dependencieson all statically-known values. This is similar to repeat-edly applying constant folding and constant propaga-tion optimizations [20] to the instruction trace. Thedead code that results is removed as described in Sec-tion 2.3.

A loop induction variable [20] is often initializedwith a constant (e.g. 0). Incrementing the inductionvariable by a constant depends only on the initializa-tion value of the induction variable, so the incrementedvalue is also statically known. Each subsequent incre-ment is likewise statically known. This removes seri-alization caused by the loop control variable, but pre-serves genuine data dependencies between loop itera-tions, including loop induction variable updates thatdepend on a variable computed value.

4 Results

In this section, we present our analysis of our ex-perimental results. First, we look at the speedup thatcan be achieved when adding a parallel co-processorto a sequential machine and show that the speedup ishighly dependent on the parallel instruction latency.We define parallel instruction latency as the ratio ofthe read-after-write latency of the parallel cores (recallwe assume a CPI of one for the sequential core). Wethen look at the effect of communication latency andbandwidth as parallel instruction latency is varied, andsee that the effect is significant, but small.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 10 100 1000 10000 100000

Fra

cti

on

of

Ins

tru

cti

on

s

on

Pa

rall

el

Pro

ce

ss

or

Parallel Instruction Latency

SPECfp

Average

SimpleScalar

SPECint

PhysBench

a.

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

1 10 100 1000 10000 100000

Fra

cti

on

of

Instr

ucti

on

s

on

Para

llel P

rocesso

r


linear

sepia

twophase

serial

b.

Figure 3. Proportion of Instructions Sched-uled on Parallel Core. Real benchmarks (a),Microbenchmarks(b)

4.1 Why Heterogeneous?

Figures 3 and 4 give some intuition for the charac-teristics of the scheduling algorithm. Figure 4 showsthe parallelism of the instructions that are scheduledto use the parallel processor when our workloads arescheduled for best performance. Figure 3(a) shows theproportion of instructions that are assigned to executeon the parallel processor. As the instruction latencyincreases, sections of the workload where the benefitof parallelism does not outweigh the cost of slower se-quential performance become scheduled onto the se-quential processor, raising the average parallelism ofthose portions that remain on the parallel processor,while reducing the proportion of instructions that arescheduled on the parallel processor. The instructionsthat are scheduled to run on the sequential processorreceive no speedup, but scheduling more instructionson the parallel processor in an attempt to increase par-allelism will only decrease speedup.

The microbenchmarks in Figure 3(b) show ourscheduling algorithm works as expected. Serial hasnearly no instructions scheduled for the parallel core.

10

100

1000

10000

100000

1 10 100 1000 10000 100000

Avera

ge P

ara

llelism

in

Para

llel P

hases


SPECfp

Average

SPECint

SimpleScalar

PhysBench

Figure 4. Parallelism on Parallel Processor

Twophase has about 18.5% of instructions in its serialcomponent that are scheduled on the sequential pro-cessor leaving 81.5% on the parallel processor, whilesepia and linear highly prefer the parallel processor.

We look at the potential speedup of adding a par-allel processor to an existing sequential machine. Fig-ures 5(a) and (b) show the speedup of our benchmarksfor varying parallel instruction latency, as a speedupover a single sequential processor. Two plots for eachbenchmark group are shown: The solid plots show thespeedup of a heterogeneous system where communica-tion has no cost, while the dashed plot shows speedupwhen communication is very expensive. We focus onthe solid plots in this section.

It can be observed from Figures 5(a) and (b) thatas the instruction latency increases, there is a signifi-cant loss in the potential speedup provided by the extraparallel processor, becoming limited by the amount ofparallelism available in the workload that can be ex-tracted, as seen in Figure 3. Since our parallel proces-sor model is somewhat optimistic, the speedups shownhere should be regarded as an upper bound of whatcan be achieved.

With a parallel processor with GPU-like instruc-tion latency of 100 cycles, SPECint would be lim-ited to a speedup of 2.2×, SPECfp to 12.7×, Physics-Bench to 2.5×, with 64%, 92%, and 72% of instructionsscheduled on the parallel processor, respectively. Thespeedup is much lower than the peak relative through-put of a GPU compared to a sequential CPU (≈ 50×),which shows that if a GPU-like processor were used asthe parallel processor in a heterogeneous system, thespeedup on these workloads would be limited by theparallelism available in the workload, while still leav-ing much of the GPU hardware idle.

In contrast, for highly-parallel workloads, thespeedups achieved at an instruction latency of 100 are

1

10

100

1000

1 10 100 1000 10000 100000

Sp

eed

up


SPECfp

SPECfp NoSwitch

SimpleScalar

SS NoSwitch

PhysBench

PhysBench NoSwitch

SPECint

SPECint NoSwitch

a.

1

10

100

1000

10000

1 10 100 1000 10000 100000

Sp

eed

up


sepia

sepia NoSwitch

linear

linear NoSwitch

twophase

twophase NoSwitch

serial

serial NoSwitch

b.

Figure 5. Speedup of Heterogeneous System:(a) Real benchmarks, (b) Microbenchmarks.Ideal communication (solid), communicationforbidden (dashed, NoSwitch).

similar to the peak throughput available in a GPU. Thehighly-parallel linear filter and sepia tone filter (Fig-ure 5(b)) kernels have enough parallelism to achieve50-70× speedup at an instruction latency of 100. Ahighly-serial workload (serial) does not benefit from theparallel processor.

Although current GPU compute solutions built withefficient low-complexity multi-threaded cores are suf-ficient to accelerate algorithms with large amountsof thread-level parallelism, general-purpose algorithmswould be unable to utilize the large number of threadcontexts provided by the GPU, while under-utilizingthe arithmetic hardware available.

4.2 Communication

In this section, we evaluate the impact of com-munication latency and bandwidth on the potentialspeedup, comparing performance between the extremecases where communication is unrestricted and commu-nication is forbidden. The solid plots in Figure 5 show

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000

Slo

wd

ow

n


SPECint

SimpleScalar

PhysBench

SPECfp

a.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

1 10 100 1000 10000 100000

Slo

wd

ow

n


serial

sepia

linear

twophase

b.

Figure 6. Slowdown of infinite communica-tion cost (NoSwitch) compared to zero com-munication cost. Real benchmarks (a), Mi-crobenchmarks (b).

speedup when there are no limitations on communi-cation, while the dashed plots (marked NoSwitch) hascommunication so expensive that the scheduler choosesto run the workload entirely on the sequential pro-cessor or parallel processor, never switching betweenthem. Figures 6(a) and (b) show the ratio between thesolid and dashed plots in Figures 5(a) and (b), respec-tively, to highlight the impact of communication. Atboth extremes of instruction latency, where the work-load is mostly sequential or mostly parallel, commu-nication has little impact. It is in the moderate rangearound 100-200 where communication potentially mat-ters most.

The potential impact of expensive (latency andbandwidth) communication is significant. For exam-ple, at a GPU-like instruction latency of 100, SPECintachieves only 56%, SPECfp 23%, and PhysicsBench44% of the performance of no communication, as canbe seen in Figure 6(a). From our microbenchmark set(Figures 5(b) and 6(b)), twophase is particularly sen-sitive to communication costs, and gets no speedup forinstruction latency above 10. We look at more realistic

0.7

0.75

0.8

0.85

0.9

0.95

1

1 10 100 1000 10000 100000

Slo

wd

ow

n F

acto

r

vs. Z

ero

-Co

st

Sw

itch


SPECint

PhysBench

SPECfp

SimpleScalar

Figure 7. Slowdown due to 100,000 cycles ofmode-switch latency. Real benchmarks.

constraints on latency and bandwidth in the followingsections.

4.2.1 Latency

Poor parallel performance is often attributed to highcommunication latency [21]. Heterogeneous processingadds a new communication requirement—the commu-nication channel between sequential and parallel pro-cessors (Figure 1). In this section, we measure theimpact of the latency of this communication channel.

We model this latency by requiring that switchingmodes between the two processor types causes a fixedamount of idle computation time. In this section, we donot consider the bandwidth of the data that needs tobe transferred. This model represents a heterogeneoussystem with shared memory (Figure 1(a)), where mi-grating a task does not involve data copying, but onlyinvolves a pipeline flush, notification to the other pro-cessor of work, and potentially flushing private cachesif caches are not coherent.

Figure 7 shows the slowdown when we include100,000 cycles of mode-switch latency in our perfor-mance model and scheduling, when compared to zero-latency mode switch.

The impact of imposing a delay for every modeswitch has only a minor effect on runtime. AlthoughFigure 6(a) suggested that the potential for perfor-mance loss due to latency is great, even when eachmode switch costs 100,000 cycles (greater than 10usat current clock rates), most of the speedup remains.We can achieve ≈85% of the performance of a hetero-geneous system with zero-cost communication. Statedanother way, reducing latency between sequential andparallel cores might provide an average ≈ 18% perfor-mance improvement.

To gain further insight into the impact of mode

0

5000

10000

15000

20000

25000

1 10 100 1000 10000 100000

Mo

de S

wit

ch

es p

er

10M

In

str

ucti

on

s


SPECint

SimpleScalar

PhysBench

SPECfp

a.

0

500

1000

1500

2000

2500

3000

1 10 100 1000 10000 100000

Mo

de S

wit

ch

es p

er

10M

In

str

ucti

on

s


SPECint

SimpleScalar

PhysBench

SPECfp

b.

x

0

20

40

60

80

100

120

1 10 100 1000 10000 100000

Mo

de S

wit

ch

es p

er

10M

In

str

ucti

on

s


SPECint

SimpleScalar

PhysBench

SPECfp

c.

Figure 8. Mode switches as switch latencyvaries: (a) zero cycles, (b) 10 cycles, (c) 1000cycles.

switch latency, Figure 8 illustrates the number of modeswitches per 10 million instructions as we vary the costof switching from zero to 1000 cycles. As the cost ofa mode switch increases the number of mode switchesdecreases. Also, more mode switches occur at interme-diate values of parallel instruction latency where thebenefit of being able to use both types processors out-weighs the cost of switching modes.

For systems with private memory (e.g. discreteGPU), data copying is required when migrating a taskbetween processors at mode switches. We considerbandwidth constraints in the next section.

4.2.2 Bandwidth

In the previous section, we saw that high communi-cation latency had only a minor effect on achievableperformance. Here, we place a bandwidth constrainton the communication between processors. Data thatneeds to be communicated between processors is re-stricted to a maximum rate, and the processors areforced to wait if data is not available in time for aninstruction to use it, as described in Section 2.3.3. Wealso include 1,000 cycles of latency as part of the model.

We first construct a model to represent PCI Express,

0.5

0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 10 100 1000 10000 100000

Slo

wd

ow

n


SimpleScalar

SPECint

PhysBench

SPECfp

Figure 9. Slowdown due to a bandwidth con-straint of 8 cycles per 32-bit value and 1,000cycles latency, similar to PCI Express x16.Real benchmarks.

1

10

100

0.001 0.01 0.1 1

Sp

eed

up

Normalized Bandwidth

SPECfp

PhysBench

SPECint

SimpleScalarPCI Express x16

AGP 4x

PCI

Figure 10. Speedup over sequential proces-sor for varying bandwidth constraints. Realbenchmarks.

as discrete GPUs are often attached to the system thisway. PCI Express x16 has a peak bandwidth of 4GB/sand latency around 250ns [4]. Assuming current pro-cessors perform about 4 billion instructions per secondon 32-bit data values, we can model PCI Express us-ing a latency of about 1,000 cycles and bandwidth of 4cycles per 32-bit value. Being somewhat pessimistic toaccount for overheads, we use a bandwidth of 8 cyclesper 32-bit value (about 2GB/s).

Figure 9 shows the performance impact of restrict-ing bandwidth to one 32-bit value every 8 clocks with1,000 cycles of latency. Slowdown is worse than with100,000 cycles of latency, but the benchmark set af-fected the most (SPECfp) can still achieve ≈67.4% ofthe ideal performance at a parallel instruction latencyof 100. Stated another way, increasing bandwidth be-tween sequential and parallel cores might provide an

average 1.48× performance improvement for workloadslike SPECfp. For workloads such as PhysicsBench andSPECint the potential benefits appear lower (1.33×and 1.07× potential speedup, respectively). Compar-ing latency (Figure 7) to bandwidth (Figure 9) con-straints, SPECfp and PhysicsBench has more perfor-mance degradation than under a pure-latency con-straint, but SPECint performs better, suggesting thatSPECint is less sensitive to bandwidth.

The above plots suggest that a heterogeneous sys-tem attached without a potentially-expensive, low-latency, high-bandwidth communication channel canstill achieve much of the potential speedup.

To further evaluate whether GPU-like systems couldbe usefully attached using even lower bandwidth in-terconnect, we measure the sensitivity of performanceto bandwidth for instruction latency 100. Figure 10shows the speedup for varying bandwidth. Bandwidth(x-axis) is normalized to 1 cycle per datum, equivalentto about 16GB/s in today’s systems. Speedup (y-axis)is relative to the workload running on a sequential pro-cessor.

SPECfp and PhysicsBench have similar sensitivityto reduced bandwidth, while SPECint’s speedup lossat low bandwidth is less significant (Figure 10). Al-though there is some loss of performance at PCI Ex-press speeds (normalized bandwidth = 1/8), about halfof the potential benefit of heterogeneity remains atPCI-like speeds (normalized bandwidth = 1/128). AtPCI Express x16 speeds, SPECint can achieve 92%,SPECfp 69%, and PhysicsBench 78% of the speedupachievable without latency and bandwidth limitations.

As can be seen from the above data, heteroge-neous systems can potentially provide significant per-formance improvements on a wide range of applica-tions, even when system cost sensitivity demands high-latency, low-bandwidth interconnect. However, it alsoshows that applications are not entirely insensitive tolatency and bandwidth, so high-performance systemswill still need to worry about increasing bandwidth andlowering latency.

The lower sensitivity to latency than to bandwidthsuggests that a shared-memory multicore heteroge-neous system would be of benefit, as sharing a singlememory system avoids data copying when migratingtasks between processors, leaving only synchronizationlatency. This could increase costs, as die size wouldincrease, and the memory system would then need tosupport the needs of both sequential and parallel pro-cessors. A high-performance off-chip interconnect likePCI Express or HyperTransport may be a good com-promise.

5 Related Work

There have been many limit studies on the amountof parallelism within sequential programs.

Wall [7] studies parallelism in SPEC92 under vari-ous limitations in branch prediction, register renaming,and memory disambiguation. Lam et al. [17] stud-ies parallelism under branch prediction, condition de-pendence analysis, and multiple-fetch. Postiff et al.[15] perform a similar analysis on the SPEC95 suiteof benchmarks. These studies showed that significantamounts of parallelism exist in typical applications un-der optimistic assumptions. These studies focused onextracting instruction-level parallelism on a single pro-cessor. As it becomes increasingly difficult to extractILP out of a single processor, performance increasesoften comes from multicore systems.

As we move towards multicore systems, there arenew constraints, such as communication latency, thatare now applicable. Vachharajani et al. [21] studiesspeedup available on homogeneous multiprocessor sys-tems. They use a greedy scheduling algorithm to assigninstructions to cores. They also scale communicationlatency between cores in the array of cores and findthat it is a significant limit on available parallelism.

In our study, we extend these analyses to heteroge-neous systems, where there are two types of processors.Vachharajani examined the impact of communicationbetween processors within a homogeneous processor ar-ray. We examine the impact of communication betweena sequential processor and an array of cores. In ourmodel, we roughly account for communication latencybetween cores within an array of cores by using higherinstruction read-after-write latency.

Heterogeneous systems are interesting because theyare commercially available [10, 25, 2] and, for GPUcompute systems, can leverage the existing softwareecosystem by using the traditional CPU as its sequen-tial processor. They have also been shown to be morearea and power efficient [16, 26, 27] than homogeneousmulticore systems.

Hill and Marty [16] uses Amdahl’s Law to show thatthere are limits to parallel speedup, and makes thecase that when one must trade per-core performancefor more cores, heterogeneous multiprocessor systemsperform better than homogeneous ones because non-parallelizable fragments of code do not benefit frommore cores, but do suffer when all cores are made slowerto accommodate more cores. They indicate that moreresearch should be done to explore “the scheduling andoverhead challenges that Amdahl’s model doesn’t cap-ture”. Our work can be viewed as an attempt to furtherquantify the impact that these challenges present.

6 Conclusion

We conducted a limit study to analyze the behaviorof a set of general purpose applications on a heteroge-neous system consisting of a sequential processor anda parallel processor with higher instruction latency.

We showed that instruction read-after-write latencyof the parallel processor was a significant factor in per-formance. In order to be useful for applications withoutcopious amounts of parallelism, we believe that instruc-tion read-after-write latencies of GPUs will need to de-crease and thus GPUs can no longer rely exclusively onfine-grain multithreading to keep utilization high. Wenote that VLIW or superscalar issue combined withfine-grained multithreading [3, 19] do not inherentlymitigate this read-after-write latency, though addingforwarding [12] might. Our data shows that latencyand bandwidth of communication between the parallelcores and the sequential core, while significant factors,have comparatively minor effects on performance. La-tency and bandwidth characteristics of PCI Expresswas sufficient to achieve most of the available perfor-mance.

Note that since our results are normalized to thesequential processor, our results scale as processor de-signs improve. As sequential processor performanceimproves in the future, the read-after-write latencyof the parallel processor will also need to improve tomatch.

Manufacturers have and will likely continue to buildsingle-chip heterogeneous multicore processors. Thedata presented in this paper may suggest the reasonsfor doing so are other than to obtain higher perfor-mance from reduced communication overheads on gen-eral purpose workloads. A subject for future work isevaluating whether such conclusions hold under morerealistic evaluation scenarios (limited hardware paral-lelism, detailed simulations, real hardware) along withexploration of a wider set of applications (ideally in-cluding real workloads carefully tuned specifically for atightly coupled single-chip heterogeneous system). Aswell, this work does not quantify the effect that in-creasing problem size [11] may have on the question ofthe benefits of heterogeneous (or asymmetric) multi-core performance.

Acknowledgements

We thank Dean Tullsen, Bob Dreyer, Hong Wang,Wilson Fung and the anonymous reviewers for theircomments on this work. This work was partly sup-ported by the Natural Sciences and Engineering Re-search Council of Canada.

References

[1] AMD Inc. The future is fusion. http://sites.amd.com/us/Documents/AMD fusion Whitepaper.pdf, 2008.

[2] AMD Inc. ATI Stream Computing User Guide, 2009.[3] AMD Inc. R700-Family Instruction Set Architecture,

2009.[4] B. Holden. Latency Comparison Between HyperTrans-

port and PCI-Express in Communications Systems.http://www.hypertransport.org/.

[5] J. A. Butts and G. Sohi. Dynamic Dead-InstructionDetection and Elimination. In Proc. Int’l Conf. onArchitectural Support for Programming Languages andOperating Systems (ASPLOS), pages 199–210, 2002.

[6] T. H. Cormen, C. E. Leiserson, R. L. Rivest, andC. Stein. Introduction to Algorithms. McGraw Hill,2nd edition, 2001.

[7] D. W. Wall. Limits of Instruction-Level Parallelism.Technical Report 93/6, DEC WRL, 1993.

[8] W. W. L. Fung, I. Sham, G. Yuan, and T. M. Aamodt.Dynamic Warp Formation: Efficient MIMD ControlFlow on SIMD Graphics Hardware. To appear in:ACM Trans. Architec. Code Optim. (TACO), 2009.

[9] G. M. Amdahl. Validity of the single-processor ap-proach to achieving large scale computing capabilities.In AFIPS Conf. Proc. vol. 30, pages 483–485, 1967.

[10] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns,T. R. Maeurer, D. Shippy. Introduction to the Cellmultiprocessor. IBM J. R&D, 49(4/5):589–604, 2005.

[11] J. L. Gustafson. Reevaluating Amdahl’s Law. Com-munications of ACM, 31(5):532–533, 1988.

[12] J. Laudon, A. Gupta, and M. Horowitz. Interleaving:A Multithreading Technique Targeting Multiproces-sors and Workstations. In ASPLOS, pages 308–318,1994.

[13] E. Lindholm, M. J. Kligard, and H. P. Moreton. Auser-programmable vertex engine. In SIGGRAPH,pages 149–158, 2001.

[14] E. Lindholm, J. Nickolls, S. Oberman, and J. Mon-trym. NVIDIA Tesla: A unified graphics and comput-ing architecture. Micro, IEEE, 28(2):39–55, March-April 2008.

[15] M. A. Postiff, D. A. Greene, G. S. Tyson, T. N. Mudge.The Limits of Instruction Level Parallelism in SPEC95Applications. ACM SIGARCH Comp. Arch. News,27(1):31–34, 1999.

[16] M. D. Hill, M. R. Marty. Amdahls Law in the Multi-core Era. IEEE Computer, 41(7):33–38, 2008.

[17] M. S. Lam, R. P. Wilson. Limits of control flow onparallelism. In Proc. 19th Int’l Symp. on ComputerArchitecture, pages 46–57, 1992.

[18] M. T. Yourst. PTLsim: A Cycle Accurate Full Systemx86-64 Microarchitectural Simulator. In IEEE Int’l

Symp. on Performance Analysis of Systems and Soft-ware ISPASS, 2007.

[19] J. Montrym and H. Moreton. The GeForce 6800. IEEEMicro, 25(2):41–51, 2005.

[20] S. Muchnick. Advanced Compiler Design and Imple-mentation. Morgan Kauffman, 1997.

[21] N. Vachharajani, M. Iyer, C. Ashok, M. Vachhara-jani, D. I. August, D. Connors. Chip multi-processorscalability for single-threaded applications. ACMSIGARCH Comp. Arch. News, 33(4):44–53, 2005.

[22] J. Nickolls, I. Buck, M. Garland, and K. Skadron. Scal-able parallel programming with cuda. Queue, 6(2):40–53, 2008.

[23] NVIDIA Corp. GeForce 8 Series.http://www.nvidia.com/page/geforce8.html.

[24] NVIDIA Corp. GeForce GTX 280.http://www.nvidia.com/object/geforce gtx 280.html.

[25] NVIDIA Corp. CUDA Programming Guide, 2.2 edi-tion, 2009.

[26] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P.Jouppi. Single-ISA Heterogeneous Multi-Core Ar-chitectures for Multithreaded Workload Performance.In Proc. 31st Int’l Symp. on Computer Architecture,page 64, 2004.

[27] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan,D. M. Tullsen. Single-ISA Heterogeneous Multi-CoreArchitectures: The Potential for Processor Power Re-duction. In Proc. 36th IEEE/ACM Int’l Symp. onMicroarchitecture, page 81, 2003.

[28] S. Borkar. Thousand Core Chips — A TechnologyPerspective. In Proc. 44th Annual Conf. on DesignAutomation, pages 746–749, 2007.

[29] T. Austin, E. Larson, D. Ernst. SimpleScalar: Aninfrastructure for computer system modeling. IEEEComputer, 35(2), 2002.

[30] T. Sherwood, E Perelman, G. Hamerly, B. Calder. Au-tomatically Characterizing Large Scale Program Be-havior. In Proc. 10th Int’l Conf. on Architectural Sup-port for Programming Languages and Operating Sys-tems ASPLOS, 2002.

[31] T. Y. Yeh, P. Faloutsos, S. J. Patel, G. Reinman. Par-allAX: an architecture for real-time physics. In Proc.34th Int’l Symp. on Computer Architecture, pages232–243, 2007.

[32] J. E. Thornton. Parallel operation in the control data6600. In AFIPS Proc. FJCC, volume 26, pages 33–40,1964.

[33] H. Wong, A. Bracy, E. Schuchman, T. M. Aamodt,J. D. Collins, P. H. Wang, G. Chinya, A. K. Groen,H. Jiang, and H. Wang. Pangaea: A Tightly-CoupledIA32 Heterogeneous Chip Multiprocessor. In Int’lConf. on Parallel Architectures and Compilation Tech-niques (PACT 2008), pages 52–61, Oct. 2008.

The Performance Potential for Single Application ... Performance Potential for Single Application...

Documents

Transcript of The Performance Potential for Single Application ... Performance Potential for Single Application...