The PARSEC Benchmark Suite: Characterization and...

Princeton University Technical Report TR-811-08, January 2008

The PARSEC Benchmark Suite: Characterization andArchitectural Implications

Christian Bienia†, Sanjeev Kumar‡, Jaswinder Pal Singh† and Kai Li†† Department of Computer Science, Princeton University ‡ Microprocessor Technology Labs, Intel

[email protected]

ABSTRACTThis paper presents and characterizes the Princeton Ap-plication Repository for Shared-Memory Computers (PAR-SEC), a benchmark suite for studies of Chip-Multiprocessors(CMPs). Previous available benchmarks for multiproces-sors have focused on high-performance computing applica-tions and used a limited number of synchronization meth-ods. PARSEC includes emerging applications in recogni-tion, mining and synthesis (RMS) as well as systems appli-cations which mimic large-scale multi-threaded commercialprograms. Our characterization shows that the benchmarksuite is diverse in working set, locality, data sharing, syn-chronization, and off-chip traffic. The benchmark suite hasbeen made available to the public.

Categories and Subject DescriptorsD.0 [Software]: [benchmark suite]

General TermsPerformance, Measurement, Experimentation

Keywordsbenchmark suite, performance measurement, multi-threading,shared-memory computers

1. INTRODUCTIONBenchmarking is the quantitative foundation of computerarchitecture research. Program execution time is the onlyaccurate way to measure performance[18]. Without a pro-gram selection that provides a representative snapshot of thetarget application space, performance results can be mis-leading and no valid conclusions may be drawn from an ex-periment outcome. CMPs require a disruptive change in or-der for programs to benefit from their full potential. Futureapplications will have to be parallel, but due to the lack of arepresentative, multi-threaded benchmark suite most scien-tists were forced to fall back to existing benchmarks. Thisusually meant the use of older High-Performance Comput-ing (HPC) workloads, smaller suites with only few programsor unparallelized benchmarks. We consider this trend ex-tremely dangerous for the whole discipline. Representativeconclusions require representative experiments and, as weargue in this paper, existing benchmark suites cannot beconsidered adequate to describe future CMP applications.

Large processor manufacturers have already reacted and de-veloped their own, internal collections of workloads. An ex-ample is the Intel RMS benchmark suite[14]. However, these

suites often include proprietary code and are not publiclyavailable. To address this problem, we created the PAR-SEC benchmark suite in collaboration with Intel Corpora-tion. It includes not only a number of important appli-cations from the RMS suite but also several leading-edgeapplications from Princeton University, Stanford Universityand the open-source domain. The goal is to create a suiteof emerging workloads that can drive CMP research.

This paper makes three contributions:

• We identify shortcomings of commonly used bench-mark suites and explain why they should not be usedto evaluate CMPs.

• We present and characterize PARSEC, a new bench-mark suite for CMPs that is diverse enough in orderto allow representative conclusions.

• Based on our characterization of PARSEC, we analyzewhat properties future CMPs must have in order tobe able to deliver scalable performance for emergingapplications. In particular, our understanding of thebehavior of future workloads allows us to quantify howCMPs must be built in order to mitigate the effects ofthe memory wall on the next generation of programs.

In Section 2 we describe why existing benchmark suites can-not be considered adequate to describe future CMP appli-cations. In Section 3, we present the PARSEC benchmarksuite and explain how it avoids the shortcomings of othercollections of benchmarks. The methodology which we useto characterize our workloads is presented in Section 4. InSections 5 to 8, we analyze the parallelization, working sets,communication behavior and off-chip traffic of the bench-mark programs. We conclude our study in Section 9.

2. MOTIVATIONThe goal of this work is to define a benchmark suite that canbe used to design the next generation of processors. In thissection, we first present the requirements for such a suite.We then discuss how the existing benchmarks fail to meetthese requirements.

2.1 Requirements for a Benchmark SuiteWe have the following five requirements for a benchmarksuite:

1


Multi-threaded Applications Shared-memory CMPs arealready ubiquitous. The trend for future processorsis to deliver large performance improvements throughincreasing core counts on CMPs while only provid-ing modest serial performance improvements. Conse-quently, applications that require additional process-ing power will need to be parallel.

Emerging Workloads Rapidly increasing processing poweris enabling a new class of applications whose compu-tational requirements were beyond the capabilities ofthe earlier generation of processors[14]. Such appli-cations are significantly different from earlier applica-tions (see Section 3). Future processors will be de-signed to meet the demands of these emerging appli-cations and a benchmark suite should represent them.

Diverse Applications are increasingly diverse, run on a va-riety of platforms and accommodate different usagemodels. They include both interactive applicationslike computer games, offline applications like data min-ing programs and programs with different paralleliza-tion models. Specialized collections of benchmarks canbe used to study some of these areas in more detail,but decisions about general-purpose processors shouldbe based on a diverse set of applications. While a trulyrepresentative suite is impossible to create, reasonableeffort should be made to maximize the diversity of theprogram selection. The number of benchmarks mustbe large enough to capture a sufficient amount of char-acteristics of the target application space.

Employ State-of-Art Techniques A number of applica-tion domains have changed dramatically over the lastdecade and use very different algorithms and tech-niques. Visual applications for example have startedto increasingly integrate physics simulations to gener-ate more realistic animations[20]. A benchmark shouldnot only represent emerging applications but also usestate-of-art techniques.

Support Research A benchmark suite intended for researchhas additional requirements compared to one used forbenchmarking real machines alone. Benchmark suitesintended for research usually go beyond pure scoringsystems and provide infrastructure to instrument, ma-nipulate, and perform detailed simulations of the in-cluded programs in an efficient manner.

2.2 Limitations of Existing Benchmark SuitesIn the remaining part of this section we analyze how existingbenchmark suites fall short of the presented requirementsand must thus be considered unsuitable for evaluating CMPperformance.

SPLASH-2 SPLASH-2 is a suite composed of multi-threadedapplications[44] and hence seems to be an ideal candi-date to measure performance of CMPs. However, itsprogram collection is skewed towards HPC and graph-ics programs. It thus does not include parallelizationmodels such as the pipeline model which are used inother application areas. SPLASH-2 should further-more not be considered state-of-art anymore. Barnes

for example implements the Barnes-Hut algorithm forN-body simulation[8]. For galaxy simulations it haslargely been superseded by the TreeSPH[19] method,which can also account for mass such as dark matterwhich is not concentrated in bodies. However, evenfor pure N-body simulation barnes must be consid-ered outdated. In 1995 Xu proposed a hybrid algo-rithm which combines the hierarchical tree algorithmand the Fourier-based Particle-Mesh (PM) method tothe superior TreePM method[45]. Our analysis showsthat similar issues exist for a number of other applica-tions of the suite including raytrace and radiosity.

SPEC CPU2006 and OMP2001 SPEC CPU2006 andSPEC OMP2001 are two of the largest and most signif-icant collections of benchmarks. They provide a snap-shot of current scientific and engineering applications.Computer architecture research, however, commonlyfocuses on the near future and should thus also con-sider emerging applications. Workloads such as sys-tems programs and parallelization models which em-ploy the producer-consumer model are not included.SPEC CPU2006 is furthermore a suite of serial pro-grams that is not intended for studies of parallel ma-chines.

Other Benchmark Suites Besides these major benchmarksuites, several smaller suites exist. They were usuallydesigned to study a specific program area and are thuslimited to a single application domain. Therefore theyusually include a smaller set of applications than a di-verse benchmark suite typically offers. Due to theselimitations they are commonly not used for scientificstudies which do not restrict themselves to the cov-ered application domain. Examples for these types ofbenchmark suites are ALPBench[25], BioParallel[22],MediaBench[1], NU-MineBench[23] and PhysicsBench[46].Because of their different focus we do not discuss thesesuites in more detail.

3. THE PARSEC BENCHMARK SUITEOne of the goals of the PARSEC suite was to assemble aprogram selection that is large and diverse enough to besufficiently representative for scientific studies. It consistsof 9 applications and 3 kernels which were chosen from awide range of application domains. In Table 1 we present aqualitative summary of their key characteristics. PARSECworkloads were selected to include different combinations ofparallel models, machine requirements and runtime behav-iors.

PARSEC meets all the requirements outlined in Section 2.1:

• Each of the applications has been parallelized.

• The PARSEC benchmark suite is not skewed towardsHPC programs, which are abundant but represent onlya niche. It focuses on emerging workloads. The algo-rithms these programs implement are usually consid-ered useful, but their computational demands are pro-hibitively high on contemporary platforms. As morepowerful processors become available in the near fu-ture, they are likely to proliferate rapidly.

2


Program Application Domain Parallelization Working Set Data UsageModel Granularity Sharing Exchange

blackscholes Financial Analysis data-parallel coarse small low lowbodytrack Computer Vision data-parallel medium medium high mediumcanneal Engineering unstructured fine unbounded high highdedup Enterprise Storage pipeline medium unbounded high highfacesim Animation data-parallel coarse large low mediumferret Similarity Search pipeline medium unbounded high highfluidanimate Animation data-parallel fine large low mediumfreqmine Data Mining data-parallel medium unbounded high mediumstreamcluster Data Mining data-parallel medium medium low mediumswaptions Financial Analysis data-parallel coarse medium low lowvips Media Processing data-parallel coarse medium low mediumx264 Media Processing pipeline coarse medium high high

Table 1: Qualitative summary of the inherent key characteristics of PARSEC benchmarks. Working setsand data usage patterns are explained and quantified in later sections. The pipeline model is a data-parallelmodel which also uses a functional partitioning. PARSEC workloads were chosen to cover different applicationdomains, parallel models and runtime behaviors.

• The workloads are diverse and were chosen from manydifferent areas such as computer vision, media pro-cessing, computational finance, enterprise servers andanimation physics.

• Each of the applications chosen represents the state-of-art technique in its area.

• PARSEC supports computer architecture research ina number of ways. The most important one is that foreach workload six input sets with different propertiesare defined. Three of these inputs are suitable for mi-croarchitectural simulation. We explain the differenttypes of input sets in more detail in Section 3.1.

3.1 Input SetsPARSEC defines six input sets for each benchmark:

test A very small input set to test the basic functionalityof the program.

simdev A very small input set which guarantees basic pro-gram behavior similar to the real behavior, intendedfor simulator test and development.

simsmall, simmedium and simlarge Input sets of differentsizes suitable for microarchitectural studies with sim-ulators.

native A large input set intended for native execution.

test and simdev are merely intended for testing and devel-opment and should not be used for scientific studies. Thethree simulator inputs for studies vary in size, but the gen-eral trend is that larger input sets contain bigger workingsets and more parallelism. Finally, the native input set isintended for performance measurements on real machinesand exceeds the computational demands which are gener-ally considered feasible for simulation by orders of magni-tude. From a scientific point of view, the native input set isthe most interesting one because it resembles real program

inputs most closely. The remaining input sets can be con-sidered coarser approximations which sacrifice accuracy fortractability. Table 2 shows a breakdown of instructions andsynchronization primitives of the simlarge input set whichwe used for the characterization study.

3.2 WorkloadsThe following workloads are part of the PARSEC suite:

3.2.1 blackscholesThe blackscholes application is an Intel RMS benchmark.It calculates the prices for a portfolio of European optionsanalytically with the Black-Scholes partial differential equa-tion (PDE)[10]

∂V

∂t+

12σ2S2 ∂2V

∂S2 + rS∂V

∂S− rV = 0

where V is an option on the underlying S with volatility σ attime t if the constant interest rate is r. There is no closed-form expression for the Black-Scholes equation and as suchit must be computed numerically[21]. The blackscholesbenchmark was chosen to represent the wide field of ana-lytic PDE solvers in general and their application in com-putational finance in particular. The program is limited bythe amount of floating-point calculations a processor canperform.

blackscholes stores the portfolio with numOptions deriva-tives in array OptionData. The program includes file option-Data.txt which provides the initialization and control ref-erence values for 1,000 options which are stored in arraydata init. The initialization data is replicated if necessaryto obtain enough derivatives for the benchmark.

The program divides the portfolio into a number of workunits equal to the number of threads and processes themconcurrently. Each thread iterates through all derivatives inits contingent and calls function BlkSchlsEqEuroNoDiv foreach of them to compute its price. If error checking was

3


Program Problem Size Instructions (Billions) Synchronization PrimitivesTotal FLOPS Reads Writes Locks Barriers Conditions

blackscholes 65,536 options 2.67 1.14 0.68 0.19 0 8 0bodytrack 4 frames, 4,000 particles 14.03 4.22 3.63 0.95 114,621 619 2,042canneal 400,000 elements 7.33 0.48 1.94 0.89 34 0 0dedup 184 MB data 37.1 0 11.71 3.13 158,979 0 1,619facesim 1 frame, 29.90 9.10 10.05 4.29 14,541 0 3,137

372,126 tetrahedraferret 256 queries, 23.97 4.51 7.49 1.18 345,778 0 1255

34,973 imagesfluidanimate 5 frames, 14.06 2.49 4.80 1.15 17,771,909 0 0

300,000 particlesfreqmine 990,000 transactions 33.45 0.00 11.31 5.24 990,025 0 0streamcluster 16,384 points per block, 22.12 11.6 9.42 0.06 191 129,600 127

1 blockswaptions 64 swaptions, 14.11 2.62 5.08 1.16 23 0 0

20,000 simulationsvips 1 image, 31.21 4.79 6.71 1.63 33,586 0 6,361

2662 × 5500 pixelsx264 128 frames, 32.43 8.76 9.01 3.11 16,767 0 1,056

640 × 360 pixels

Table 2: Breakdown of instructions and synchronization primitives for input set simlarge on a system with8 cores. All numbers are totals across all threads. Numbers for synchronization primitives also includeprimitives in system libraries. ’Locks’ and ’Barriers’ are all lock- resp. barrier-based synchronizations,’Conditions’ are all waits on condition variables.

enabled at compile time it also compares the result with thereference price.

The input sets for blackscholes are sized as follows:

• test: 1 option

• simdev: 16 options

• simsmall: 4,096 options

• simmedium: 16,384 options

• simlarge: 65,536 options

• native: 10,000,000 options

3.2.2 bodytrackThe bodytrack computer vision application is an Intel RMSworkload which tracks a 3D pose of a marker-less humanbody with multiple cameras through an image sequence[13,6]. bodytrack employs an annealed particle filter to trackthe pose using edges and the foreground silhouette as im-age features, based on a 10 segment 3D kinematic tree bodymodel. These two image features were chosen because theyexhibit a high degree of invariance under a wide range ofconditions and because they are easy to extract. An an-nealed particle filter was employed in order to be able tosearch high dimensional configuration spaces without hav-ing to rely on any assumptions of the tracked body such asthe existence of markers or constrained movements. Thisbenchmark was included due to the increasing significanceof computer vision algorithms in areas such as video surveil-lance, character animation and computer interfaces.

For every frame set Zt of the input videos at time step t,the bodytrack benchmark executes the following steps:

1. The image features of observation Zt are extracted.The features will be used to compute the likelihood ofa given pose in the annealed particle filter.

2. Every time step t the filter makes an annealing runthrough all M annealing layers, starting with layerm = M .

3. Each layer m uses a set of N unweighted particleswhich are the result of the previous filter update stepto begin with.

St,m = {(s(1)t,m)...(s(N)

t,m)}.

Each particle s(i)t,m is an instance of the multi-variate

model configuration X which encodes the location andstate of the tracked body.

4. Each particle s(i)t,m is then assigned a weight π

(i)t,m by us-

ing weighting function ω(Zt, X) corresponding to thelikelihood of X given the image features in Zt scaledby an annealing level factor:

π(i)t,m ∝ ω(Zt, s

(i)t,m).

The weights are normalized so thatPN

i=1 π(i)t,m = 1.

The result is the weighted particle set

Sπt,m = {(s(1)

t,m, π(1)t,m)...(s(N)

t,m , π(N)t,m )}.

5. N particles are randomly drawn from set Sπt,m with

a probability equal to their weight π(i)t,m to obtain the

temporary weighted particle set

4


Sπt,m = {(s(1)

t,m, π(1)t,m)...(s(N)

t,m , π(N)t,m )}.

Each particle s(i)t,m is then used to produce particle

s(i)t,m−1 = s

(i)t,m + Bm

where Bm is a multi-variate Gaussian random vari-able. The result is particle set Sπ

t,m−1 which is used toinitialize layer m − 1.

6. The process is repeated until all layers have been pro-cessed and the final particle set Sπ

t,0 has been com-puted.

7. Sπt,0 is used to compute the estimated model configu-

ration χt for time step t by calculating the weightedaverage of all configuration instances:

χt =NX

i=1

s(i)t,0π

(i)t,0.

8. The set St+1,M is then produced from Sπt,0 using

s(i)t+1,M = s

(i)t,0 + B0.

In the subsequent time step t+1 the set St+1,M is usedto initialize layer M.

The likelihood ω(Zt, s(i)t,m) which is used to determine the

particle weights π(i)t,m is computed by projecting the geom-

etry of the human body model into the image observationsZt for each camera and determining the error based on theimage features. The likelihood is a measure of the 3D bodymodel alignment with the foreground and edges in the im-ages. The body model consists of conic cylinders to rep-resent 10 body parts 2 for each limb plus the torso andthe head. Each cylinder is represented by a length and aradius for each end. The body parts are assembled into akinematic tree based upon the joint angles. Each particlerepresents the set of joint angles plus a global translation.To evaluate the likelihood of a given particle, the geometryof the body model is first built in 3D space given the anglesand translation. Next, each 3D body part is projected ontoeach of the 2D images as a quadrilateral. A likelihood valueis then computed based on the two image features the fore-ground map and the edge distance map. To compute theforeground term, samples are taken within the interior ofeach 2D body part projection and compared with the bi-nary foreground map images. Samples that correspond toforeground increase the likelihood while samples that corre-spond to background are penalized. The edge map gives ameasure of the distance from an edge in the image - valuescloser to an edge have a higher value. To compute the edgeterm samples are taken along the axis-parallel edges of each2D body part projection and the edge map values at eachsample are summed together. In this way, samples that arecloser to edges in the images increase the likelihood whilesamples farther from edges are penalized.

bodytrack has a persistent thread pool which is implementedin class WorkPoolPthread. The main thread executes theprogram and sends a task to the thread pool with methodSignalCmd whenever it reaches a parallel kernel. It resumesexecution of the program as soon as it receives the resultfrom the worker threads. Possible tasks are encoded by enu-meration threadCommands in class WorkPoolPthread. Theprogram has three parallel kernels:

Edge detection (Step 1) bodytrack employs a gradientbased edge detection mask to find edges. The resultis compared against a threshold to eliminate spuri-ous edges. Edge detection is implemented in functionGradientMagThreshold. The output of this kernel willbe further refined before it is used to compute the par-ticle weights.

Edge smoothing (Step 1) A separable Gaussian filter ofsize 7×7 pixels is used to smooth the edges in functionGaussianBlur. The result is remapped between 0 and1 to produce a pixel map in which the value of eachpixel is related to its distance from an edge. The kernelhas two parallel phases, one to filter image rows andone to filter image columns.

Calculate particle weights (Step 4) This kernel evalu-ates the foreground silhouette and the image edges pro-duced earlier to compute the weights for the particles.This kernel is executed once for every annealing layerduring every time step, making it the computationallymost intensive part of the body tracker.

The parallel kernels use tickets to distribute the work amongthreads balance the load dynamically. The ticketing mecha-nism is implemented in class TicketDispenser and behaveslike a shared counter.

The input sets for bodytrack are defined as follows:

• test: 4 cameras, 1 frame, 5 particles, 1 annealing layer

• simdev: 4 cameras, 1 frame, 100 particles, 3 annealinglayers

• simsmall: 4 cameras, 1 frame, 1,000 particles, 5 an-nealing layers

• simmedium: 4 cameras, 2 frames, 2,000 particles, 5 an-nealing layers

• simlarge: 4 cameras, 4 frames, 4,000 particles, 5 an-nealing layers

• native: 4 cameras, 261 frames, 4,000 particles, 5 an-nealing layers

3.2.3 cannealThis kernel was developed by Princeton University. It usescache-aware simulated annealing (SA) to minimize the rout-ing cost of a chip design[7]. SA is a common method toapproximate the global optimum in a large search space.Canneal pseudo-randomly picks pairs of elements and tries

5


to swap them. To increase data reuse, the algorithm dis-cards only one element during each iteration which effec-tively reduces cache capacity misses. The SA method ac-cepts swaps which increase the routing cost with a certainprobability to make an escape from local optima possible.This probability continuously decreases during runtime toallow the design to converge. The program was includedin the PARSEC program selection to represent engineeringworkloads, for the fine-grained parallelism with its lock-freesynchronization techniques and due to its pseudo-randomworst-case memory access pattern.

canneal uses a very aggressive synchronization strategy thatis based on data race recovery instead of avoidance. Pointersto the elements are dereferenced and swapped atomically,but no locks are held while a potential swap is evaluated.This can cause disadvantageous swaps if one of the relevantelements has been replaced by another thread during thattime. This equals a higher effective probability to acceptswaps which increase the routing cost, and the SA methodautomatically recovers from it. The swap operation employslock-free synchronization which is implemented with atomicinstructions. An alternative implementation which reliedon conventional locks turned out to be too inefficient dueto excessive locking overhead. The synchronization routineswith the atomic instructions are taken from the BSD kernel.Support for most new architectures can be added easily bycopying the correct header file from the BSD kernel sources.

The annealing algorithm is implemented in the Run functionof the annealer thread class. Each thread uses the func-tion get random element to pseudo-randomly pick one newnetlist element per iteration with a Mersenne Twister[31].calculate delta routing cost is called to compute thechange of the total routing cost if the two elements areswapped. accept move evaluates the change in cost and thecurrent temperature and decides whether the change is to becommitted. Finally, accepted swaps are executed by callingswap locations.

canneal implements an AtomicPtr class which encapsulatesa shared pointer to the location of a netlist element. Thepointer is atomically accessed and modified with the Getand Set functions offered by the class. A special Swap mem-ber function executes an atomic swap of two encapsulatedpointers. If an access is currently in progress the functionsspin until the operation could be completed. The implemen-tation of Swap imposes a partial order to avoid deadlocks byprocessing the pointer at the lower memory location first.

We provide the following input sets for canneal:

• test: 5 swaps per temperature step, 100◦ start tem-perature, 10 netlist elements

• simdev: 100 swaps per temperature step, 300◦ starttemperature, 100 netlist elements

• simsmall: 10,000 swaps per temperature step, 2, 000◦

start temperature, 100,000 netlist elements

• simmedium: 15,000 swaps per temperature step, 2, 000◦


• simlarge: 15,000 swaps per temperature step, 2, 000◦


• native: 15,000 swaps per temperature step, 2, 000◦

start temperature, 2,500,000 netlist elements

3.2.4 dedupThe dedup kernel was developed by Princeton University.It compresses a data stream with a combination of globalcompression and local compression in order to achieve highcompression ratios. Such a compression is called ’dedupli-cation’. The reason for the inclusion of this kernel is thatdeduplication has become a mainstream method to com-press storage footprints for new-generation backup storagesystems[36] and to compress communication data for new-generation bandwidth optimized networking appliances[39].

The kernel uses a pipelined programming model to paral-lelize the compression to mimic real-world implementations.There are five pipeline stages the intermediate three of whichare parallel. In the first stage, dedup reads the input streamand breaks it up into coarse-grained chunks to get indepen-dent work units for the threads. The second stage anchorseach chunk into fine-grained small segments with rolling fin-gerprinting[29, 11]. The third pipeline stage computes ahash value for each data segment. The fourth stage com-presses each data segment with the Ziv-Lempel algorithmand builds a global hash table that maps hash values todata. The final stage assembles the deduplicated outputstream consisting of hash values and compressed data seg-ments.

Anchoring is a method which identifies brief sequences in adata stream that are identical with sufficiently high prob-ability. It uses fast Rabin-Karp fingerprints[24] to detectidentity. The data is then broken up into two separateblocks at the determined location. This method ensuresthat fragmenting a data stream is unlikely to obscure du-plicate sequences since duplicates are identified on a blockbasis.

dedup uses a separate thread pool for each parallel pipelinestage. Each thread pool should at least have a number ofthreads equal to the number of available cores to allow thesystem to fully work on any stage should the need arise.The operating system scheduler is responsible for a threadschedule which will maximize the overall throughput of thepipeline. In order to avoid lock contention, the number ofqueues is scaled with the number of threads, with a smallgroup of threads sharing an input and output queue at atime.

dedup employs the following five kernels, one for each pipelinestage:

Coarse-grained fragmentation This serial kernel takesthe input stream and breaks it up into work unitswhich can be processed independently from each otherby the parallel pipeline stages of dedup. It is imple-mented in function DataProcess. First, the kernelreads the input file from disk. It then determines thelocations where the data is to be split up by jumping afixed length in the buffer for each chunk. The resulting

6


data blocks are enqueued in order to be further refinedby the subsequent stage.

Fine-grained fragmentation This parallel kernel uses Rabin-Karp fingerprints to break a coarse-grained data chunkup into fine-grained fragments. It scans each inputblock starting from the beginning. An anchor is foundif the lowest 12 bits of the Rabin-Karp hash-sum are 0.The data is then split up at the location of the anchor.On average, this produces blocks of size 212/8 = 512bytes. The fine-grained data blocks are sent to thesubsequent pipeline stage to compute their checksum.This kernel is implemented in function FindAllAnchor.

Hash computation To uniquely identify a fine-grained datablock, this parallel kernel computes the SHA1 check-sum of each chunk and checks for duplicate blocks withthe use of a global database. It is implemented in func-tion ChunkProcess. A hash table which is indexedwith the SHA1 sum serves as the database. Eachbucket of the hash table is associated with an inde-pendent lock in order to synchronize accesses. Thelarge number of buckets and therefore locks makes theprobability of lock contention very low in practice.

Once the SHA1 sum of a data block is available, thekernel checks whether a corresponding entry alreadyexists in the database. If no entry could be found, thedata block is added to the hash table and sent to thecompression stage. If an entry already exists the blockis classified as a duplicate. The compression stage isomitted and the block is sent directly to the pipelinestage which assembles the output stream.

Compression This kernel compresses data blocks in par-allel. It is implemented in function Compress. Oncethe compressed image of a data block is available itis added to the database and the corresponding datablock is sent to the next pipeline stage. Every datablock is compressed only once because the previousstage does not send duplicates to the compression stage.

Assemble output stream This serial kernel reorders thedata blocks and produces a compressed output stream.It is implemented in the SendBlock function. Thestages which fragment the input stream into fine-graineddata blocks add sequence numbers to allow a recon-struction of the original order. Because data fragmen-tation occurs in two different pipeline stages, two lev-els of sequence numbers have to be considered - onefor each granularity level. SendBlock uses a searchtree for the first level and a heap for the second level.The search tree allows rapid searches for the correctheap corresponding to the current first-level sequencenumber. For second-level sequence numbers only theminimum has to be found and hence a heap is used.

Once the next data block in the sequence becomesavailable it is removed from the reordering structures.If it has not been written to the output stream yet,its compressed image is emitted. Otherwise it is a du-plicate and only its SHA1 signature is written as aplaceholder. The kernel uses the global hash table tokeep track of the output status of each data block.

Each input for dedup is an archive which contains a selectionof files. The archives have the following sizes:

• test: 10 KB

• simdev: 1.1 MB

• simsmall: 10 MB

• simmedium: 31 MB

• simlarge: 184 MB

• native: 672 MB

3.2.5 facesimThis Intel RMS application was originally developed by Stan-ford University. It takes a model of a human face and a timesequence of muscle activations and computes a visually real-istic animation of the modeled face by simulating the under-lying physics[38, 40]. The goal is to create a visually realis-tic result. Certain effects such as inertial movements wouldhave only a small visible effect and are not simulated[20].The workload was included in the benchmark suite becausean increasing number of computer games and other formsof animation employ physical simulation to create more re-alistic virtual environment. Human faces in particular areobserved with more attention from users than other detailsof a virtual world, making their realistic presentation a keyelement for animations.

The parallelization uses a static partitioning of the mesh.Data that spans nodes belonging to more than one parti-tion is replicated. Every time step the partitions processall elements that contain at least one node owned by theparticle, but only results for nodes which are owned by thepartition are written.

The iteration which computes the state of the face mesh atthe end of each iteration is implemented in function Advance -One Time Step Quasistatic. facesim employs the fork-joinmodel to process computationally intensive tasks in parallel.It uses the following three parallel kernels for its computa-tions:

Update state This kernel uses the Newton-Raphson meth-od to solve the nonlinear system of equations in orderto find the steady state of the simulated mesh. Thisquasi-static scheme achieves speedups of one to two or-ders of magnitudes over explicit schemes by ignoringinertial effects. It is not suitable for the simulation ofless constrained phenomena such as ballistic motion,but it is sufficiently accurate to simulate effects suchas flesh deformation where the material is heavily in-fluenced by contact, collision and self-collision and in-ertial effects only have a minor impact on the state.

In each Newton-Raphson iteration, the kernel reducesthe nonlinear system of equations to a linear systemwhich is guaranteed to be positive definite and sym-metric. These two properties allow the use of a fastconjugate gradient solver later on. One iteration stepis computed by function Update Position Based State.The matrix of the linear system is sparse and can hence

7


be stored in two one-dimensional arrays - dX full andR full. The matrix is the sum of the contribution ofeach tetrahedron of the face mesh.

Add forces This module computes the velocity-independentforces acting on the simulation mesh. After the ma-trix of the linear system with the position-independentstate has been computed by the previous kernel, theright-hand side of that system has to be calculated.The kernel does this by iterating over all tetrahedraof the mesh, reading the positions of the vertices andcomputing the force contribution to each of the fournodes.

Conjugate gradient This kernel uses the conjugate gra-dient algorithm to solve the linear equation system as-sembled by the previous two modules. The two arraysdX full and R full which store the sparse matrix aresequentially accessed and matrix-vector multiplicationis employed to solve the system.

The input sets of facesim all use the same face mesh. Scal-ing down the resolution of the mesh to create more tractableinput sizes is impractical. A reduction of the number of el-ements in the model would result in under-resolution of themuscle action and cause problems for collision detection[20].Our input sets for facesim are defined as follows:

• test: Print out help message.

• simdev: 80,598 particles, 372,126 tetrahedra, 1 frame

• simsmall: Same as simdev

• simmedium: Same as simdev

• simlarge: Same as simdev

• native: Same as simdev, but with 100 frames

3.2.6 ferretThis application is based on the Ferret toolkit which is usedfor content-based similarity search of feature-rich data suchas audio, images, video, 3D shapes and so on[27]. It wasdeveloped by Princeton University. The reason for the in-clusion in the benchmark is that it represents emerging next-generation desktop and Internet search engines for non-textdocument data types. In the benchmark, we have configuredthe Ferret toolkit for image similarity search. Ferret is par-allelized using the pipeline model with six stages. The firstand the last stage are for input and output. The middle fourstages are for query image segmentation, feature extraction,indexing of candidate sets with multi-probe Locality Sen-sitive Hashing (LSH)[28] and ranking. Each stage has itsown thread pool and the basic work unit of the pipeline isa query image.

Segmentation is the process of decomposing an image intoseparate areas which display different objects. The rationalebehind this step is that in many cases only parts of an imageare of interest, such as the foreground. Segmentation allowsthe subsequent stages to assign a higher weight to imageparts which are considered relevant and seem to belong to-gether. After segmentation, ferret extracts a feature vector

from every segment. A feature vector is a multi-dimensionalmathematical description of the segment contents. It en-codes fundamental properties such as color, shape and area.Once the feature vectors are known, the indexing stage canquery the image database to obtain a candidate set of im-ages. The database is organized as a set of hash tableswhich are indexed with multi-probe LSH[28]. This methoduses hash functions which map similar feature vectors tothe same hash bucket with high probability. Because thenumber of hash buckets is very high, multi-probe LSH firstderives a probing sequence which considers the success prob-abilities for finding a candidate image in a bucket. It thenemploys a step-wise approach which indexes buckets witha higher success probability first. After a candidate set ofimages has been obtained by the indexing stage, it is sent tothe ranking stage which computes a detailed similarity es-timate and orders the images according to their calculatedrank. The similarity estimate is derived by analyzing andweighing the pair-wise distances between the segments ofthe query image and the candidate images. The underlyingmetric employed is the Earth Mover’s Distance (EMD)[37].For two images X and Y , it is defined as

EMD(X, Y ) = minX

i

Xj

fijd(Xi, Yj)

where Xi and Yj denote segments of X and Y and fij is theextent to which Xi is matched to Yj .

The first and the last pipeline stage of ferret are serial.The remaining four modules are parallel:

Image segmentation This kernel uses computer vision tech-niques to break an image up into non-overlapping seg-ments. The pipeline stage is implemented in func-tion t seg, which calls image segment for every image.This function uses statistical region merging (SRM)[33]to segment the image. This method organizes the pix-els of an image in sets, starting with a fine-granulardecomposition. It repeatedly merges them until thefinal segmentation has been reached.

Feature extraction This module computes a 14-dimen-sional feature vector for each image segment. The fea-tures extracted are the bounding box of the segment(5 dimensions) and its color moments (9 dimensions).A bounding box is the minimum axis-aligned rectan-gle which includes the segment. Color moments is acompact representation of the color distribution. Itis conceptually similar to a histogram but uses fewerdimensions. Segments are assigned a weight whichis proportional to the square root of its size. Thisstage is implemented in function t extract. It callsimage extract helper to compute the feature vectorsfor every image.

Indexing The indexing stage queries the image database toobtain no more than twice the number of images whichare allowed to appear in the final ranking. This stageis implemented in function t vec. ferret manages im-age data in tables which have type cass table t. Ta-bles can be queried with function cass table query.

8


The indexing stage uses this function to access thedatabase in order to generate the candidate set of typecass result t for the current query image. Indexingemploys LSH for the probing which is implemented infunction LSH query.

Ranking This module performs a detailed similarity com-putation. From the candidate set obtained by the in-dexing stage it chooses the final set of images whichare most similar to the query image and ranks them.The ranking stage is implemented in function t rank.It employs cass table query to analyze the candidateset and to compute the final ranking with EMD. Thetype of query that cass table query is to perform canbe described with a structure of type cass query t.

The number of query images determine the amount of par-allelism. The working set size is dominated by the size ofthe image database. The input sets for ferret are sized asfollows:

• test: 1 image queries, database with 1 image, find top1 image

• simdev: 4 image queries, database with 100 images,find top 5 images

• simsmall: 16 image queries, database with 3,544 im-ages, find top 10 images

• simmedium: 64 image queries, database with 13,787images, find top 10 images

• simlarge: 256 image queries, database with 34,973images, find top 10 images

• native: 3,500 image queries, database with 59,695 im-ages, find top 50 images

3.2.7 fluidanimateThis Intel RMS application uses an extension of the SmoothedParticle Hydrodynamics (SPH) method to simulate an in-compressible fluid for interactive animation purposes[32]. Itsoutput can be visualized by detecting and rendering thesurface of the fluid. The force density fields are deriveddirectly from the Navier-Stokes equation. fluidanimateuses special-purpose kernels to increase stability and speed.fluidanimate was included in the PARSEC benchmark suitebecause of the increasing significance of physics simulationsfor computer games and other forms of real-time animations.

A simplified version of the Navier-Stokes equation for incom-pressible fluids[35] which formulates conservation of momen-tum is

ρ(∂v

∂t+ v · ∇v) = −∇p + ρg + µ∇2v

where v is a velocity field, ρ a density field, p a pressurefield, g an external force density field and µ the viscosityof the fluid. The SPH method uses particles to model the

state of the fluid at discrete locations and interpolates inter-mediate values with radial symmetrical smoothing kernels.An advantage of this method is the automatic conservationof mass due to a constant number of particles, but it alonedoes not guarantee certain physical principals such as sym-metry of forces which have to be enforced separately. TheSPH algorithm derives a scalar quantity A at location r bya weighted sum of all particles:

AS(r) =X

j

mjAj

ρjW (r − rj , h).

In the equation, j iterates over all particles, mj is the massof particle j, rj its position, ρj the density at its locationand Aj the respective field quantity. W (r − rj , h) is thesmoothing kernel to use for the interpolation with core ra-dius h. Smoothing kernels are employed in order to makethe SPH method stable and accurate. Because each particlei represents a volume with constant mass mi, the densityρi appears in the equation and has to be recomputed everytime step. The density at a location r can be calculated bysubstituting A with ρ in the previous equation:

ρS(r) =X

j

mjW (r − rj , h)

.

Applying the SPH interpolation equation to the pressureterm −∇p and the viscosity term µ∇2 of the Navier-Stokesequation yields the equations for the pressure and viscosityforces, but in order to solve the force symmetry problems ofthe SPH method, fluidanimate employs slightly modifiedformulas:

fpressurei = −

Xj

mjpi + pj

2ρj∇W (ri − rj , h)

fviscosityi = µ

Xj

mjvi − vj

ρj∇2W (ri − rj , h)

Stability, accuracy and speed of fluidanimate are highlydependent on its smoothing kernels. In all cases but thepressure and viscosity computations the program uses thefollowing kernel:

Wpoly6(r, h) =315

64πh9

((h2 − r2)3 0 ≤ r ≤ h

0 else

One feature of this kernel is that the distance r only ap-pears squared. The computation of square roots is thusnot necessary to evaluate it. For pressure computations,fluidanimate uses Desbrun’s spiky kernel Wspiky[12] andWviscosity for viscosity forces:

9


Wspiky(r, h) =15πh6

((h − r)3 0 ≤ r ≤ h

0 else

Wviscosity(r, h) =15

2πh3

(− r3

2h3 + r2

h2 + h2r

− 1 0 ≤ r ≤ h

0 else

The scene geometry employed by fluidanimate is a box inwhich the fluid resides. All collisions are handled by addingforces in order to change the direction of movement of theinvolved particles instead of modifying the velocity directly.The workload uses Verlet integration[42] to update the posi-tion of the particles. This scheme does not store the velocityof the particles explicitly, but their previous location in addi-tion to the current position. The current velocity can thusbe deduced from the distance traveled since the last timestep. The force and mass are then used to compute the ac-celeration and subsequently the new velocity. This schemeis more robust because the velocity is implicitly given.

Every time step, fluidanimate executes five kernels, thefirst two of which were further broken up into several smallersteps:

Rebuild spatial index Because the smoothing kernelsW (r − rj , h) have finite support h, particles can onlyinteract with each other up to the distance h. Theprogram uses a spatial indexing structure in order toexploit proximity information and limit the numberof particles which have to be evaluated. FunctionsClearParticles and RebuildGrid build this accelera-tion structure which is used by the subsequent steps.

Compute densities This kernel estimates the fluid den-sity at the position of each particle by analyzing howclosely particles are packed in its neighborhood. Ina region in which particles are packed together moreclosely, the density will be higher. This kernel has 3phases which are implemented in the functions Init-DensitiesAndForces, ComputeDensities and Compute-Densities2.

Compute forces Once the densities are known, they canbe used to compute the forces. This step happens infunction ComputeForces. The kernel evaluates pres-sure, viscosity and also gravity as the only externalinfluence. Collisions between particles are handled im-plicitly during this step, too.

Handle collisions with scene geometry The next ker-nel updates the forces in order to handle collisions ofparticles with the scene geometry. This step is imple-mented in function ProcessCollisions.

Update positions of particles Finally, the forces can beused to calculate the acceleration of each particle andupdate its position. fluidanimate uses a Verlet in-tegrator[42] for these computations which is imple-mented in function AdvanceParticles.

The input sets for fluidanimate are sized as follows:

• test: 5,000 particles, 1 frame

• simdev: 15,000 particles, 3 frames

• simsmall: 35,000 particles, 5 frames

• simmedium: 100,000 particles, 5 frames

• simlarge: 300,000 particles, 5 frames

• native: 500,000 particles, 500 frames

3.2.8 freqmineThe freqmine application employs an array-based versionof the FP-growth (Frequent Pattern-growth) method[15] forFrequent Itemset Mining (FIMI). It is an Intel RMS bench-mark which was originally developed by Concordia Univer-sity. FIMI is the basis of Association Rule Mining (ARM),a very common data mining problem which is relevant forareas such as protein sequences, market data or log anal-ysis. The serial program this benchmark is based on wonthe FIMI’03 best implementation award for its efficiency.freqmine was included in the PARSEC benchmark suite be-cause of the increasing demand for data mining techniqueswhich is driven by the rapid growth of the volume of storedinformation.

FP-growth stores all relevant frequency information of thetransaction database in a compact data structure called FP-tree (Frequent Pattern-tree)[16]. An FP-tree is composedof three parts: First, a prefix tree encodes the transactiondata such that each branch represents a frequent itemset.The nodes along the branches are stored in decreasing orderof frequency of the corresponding item. The prefix tree is amore compact representation of the transaction database be-cause overlapping itemsets share prefixes of the correspond-ing branches. The second component of the FP-tree is aheader table which stores the number of occurrences of eachitem in decreasing order of frequency. Each entry is also as-sociated with a pointer to a node of the FP-tree. All nodeswhich are associated with the same item are linked to a list.The list can be traversed by looking up the correspondingitem in the header table and following the links to the end.Each node furthermore contains a counter that encodes howoften the represented itemset as seen from the root to thecurrent node occurs in the transaction database. The thirdcomponent of the FP-tree is a lookup table which storesthe frequencies of all 2-itemsets. A row in the lookup tablegives all occurrences of items in itemsets which end with theassociated item. This information can be used during themining phase to omit certain FP-tree scans and is the ma-jor improvement of the implemented algorithm. The lookuptable is especially effective if the dataset is sparse which isusually the case. The FP-trees are then very big due tothe fact that only few prefixes are shared. In that case treetraversals are more expensive, and the benefit from beingable to omit them is greater. The initial FP-tree can beconstructed with only two scans of the original database,the first one to construct the header table and the secondone to compute the remaining parts of the FP-tree.

In order to mine the data for frequent itemsets, the FP-growth method traverses the FP-tree data structure and re-cursively constructs new FP-trees until the complete set offrequent itemsets is generated. To construct a new FP-tree

10


TX∪{i} for an item i in the header of an existing FP-treeTX , the algorithm first obtains a new pattern base from thelookup table. The base is used to initialize the header of thenew tree TX∪{i}. Starting from item i in the header tableof the existing FP-tree TX , the algorithm then traverses theassociated linked list of all item occurrences. The patternsassociated with the visited branches are then inserted intothe new FP-tree TX∪{i}. The resulting FP-tree is less bushybecause it was constructed from fewer itemsets. The recur-sion terminates when an FP-tree was built which has onlyone path. The properties of the algorithm guarantee thatthis is a frequent itemset.

freqmine has been parallelized with OpenMP. It employsthree parallel kernels:

Build FP-tree header This kernel scans the transactiondatabase and counts the number of occurrences of eachitem. It performs the first of two database scans nec-essary to construct the FP-tree. The result of thisoperation is the header table for the FP-tree whichcontains the item frequency information. This kernelhas one parallelized loop and is implemented in func-tion scan1 DB.

Construct prefix tree The next kernel builds the initialtree structure of the FP-tree. It performs the secondand final scan of the transaction database necessaryto build the data structures which will be used for theactual mining operation. The kernel has four paral-lelized loops. It is implemented in function scan2 DBwhich contains two of them. The remaining two loopsare in its helper function database tiling.

Mine data The last kernel uses the data structures previ-ously computed and mines them to recursively obtainthe frequent itemset information. It is an improvedversion of the conventional FP-growth method[16]. Thismodule has similarities with the previous two kernelswhich construct the initial FP-tree because it builds anew FP-tree for every recursion step.The module is implemented in function FP growth first.It first derives the initial lookup table from the currentFP-tree by calling first transform FPTree into FP-Array. This function executes the first of two paral-lelized loops. After that the second parallelized loop isexecuted in which the recursive function FP growth iscalled. It is the equivalent of FP growth first. Eachthread calls FP growth independently so that a num-ber of recursions up to the number of threads can beactive.

The input sets for freqmine are defined as follows:

• test: Database with 3 synthetic transactions, mini-mum support 1.

• simdev: Database with 1,000 synthetic transactions,minimum support 3.

• simsmall: Database with 250,000 anonymized clickstreams from a Hungarian online news portal, mini-mum support 220.

• simmedium: Same as simsmall but with 500,000 clickstreams, minimum support 410.

• simlarge: Same as simsmall but with 990,000 clickstreams, minimum support 790.

• native: Database composed of spidered collection of250,000 web HTML documents[26], minimum support11,000.

3.2.9 streamclusterThis RMS kernel was developed by Princeton University andsolves the online clustering problem[34]: For a stream ofinput points, it finds a predetermined number of mediansso that each point is assigned to its nearest center. Thequality of the clustering is measured by the sum of squareddistances (SSQ) metric. Stream clustering is a common op-eration where large amounts or continuously produced datahas to be organized under real-time conditions, for examplenetwork intrusion detection, pattern recognition and datamining. The program spends most of its time evaluating thegain of opening a new center. This operation uses a paral-lelization scheme which employs static partitioning of datapoints. The program is memory bound for low-dimensionaldata and becomes increasingly computationally intensive asthe dimensionality increases. Due to its online character theworking set size of the algorithm can be chosen indepen-dently from the input data. streamcluster was included inthe PARSEC benchmark suite because of the importance ofdata mining algorithms and the prevalence of problems withstreaming characteristics.

The parallel gain computation is implemented in functionpgain. Given a preliminary solution, the function computeshow much cost can be saved by opening a new center. Forevery new point, it weighs the cost of making it a new centerand reassigning some of the existing points to it against thesavings caused by minimizing the distance

d(x, y) = |x − y|2

between two points x and y for all points. The distancecomputation is implemented in function dist. If the heuris-tic determines that the change would be advantageous theresults are committed.

The amount of parallelism and the working set size of aproblem are dominated by the block size. The input sets ofswaptions are defined as follows:

• test: 10 input points, block size 10 points, 1 pointdimension, 2–5 centers, up to 5 intermediate centersallowed

• simdev: 16 input points, block size 16 points, 3 pointdimensions, 3–10 centers, up to 10 intermediate centersallowed

• simsmall: 4,096 input points, block size 4,096 points,32 point dimensions, 10–20 centers, up to 1,000 inter-mediate centers allowed

11


• simmedium: 8,192 input points, block size 8,192 points,64 point dimensions, 10–20 centers, up to 1,000 inter-mediate centers allowed

• simlarge: 16,384 input points, block size 16,384 points,128 point dimensions, 10–20 centers, up to 1,000 inter-mediate centers allowed

• native: 1,000,000 input points, block size 200,000 points,128 point dimensions, 10–20 centers, up to 5,000 inter-mediate centers allowed

3.2.10 swaptionsThe swaptions application is an Intel RMS workload whichuses the Heath-Jarrow-Morton (HJM) framework to price aportfolio of swaptions. The HJM framework describes howinterest rates evolve for risk management and asset liabilitymanagement[17] for a class of models. Its central insight isthat there is an explicit relationship between the drift andvolatility parameters of the forward-rate dynamics in a no-arbitrage market. Because HJM models are non-Markovianthe analytic approach of solving the PDE to price a deriva-tive cannot be used. Swaptions therefore employs MonteCarlo (MC) simulation to compute the prices. The work-load was included in the benchmark suite because of thesignificance of PDEs and the wide use of Monte Carlo sim-ulation.

The program stores the portfolio in the swaptions array.Each entry corresponds to one derivative. Swaptions parti-tions the array into a number of blocks equal to the num-ber of threads and assigns one block to every thread. Eachthread iterates through all swaptions in the work unit it wasassigned and calls the function HJM Swaption Blocking forevery entry in order to compute the price. This function in-vokes HJM SimPath Forward Blocking to generate a randomHJM path for each MC run. Based on the generated paththe value of the swaption is computed.

The following input sets are provided for swaptions:

• test: 1 swaption, 5 simulations

• simdev: 3 swaptions, 50 simulations

• simsmall: 16 swaptions, 5,000 simulations

• simmedium: 32 swaptions, 10,000 simulations

• simlarge: 64 swaptions, 20,000 simulations

• native: 128 swaptions, 1,000,000 simulations

3.2.11 vipsThis application is based on the VASARI Image ProcessingSystem (VIPS)[30] which was originally developed throughseveral projects funded by European Union (EU) grants.The benchmark version is derived from a print on demandservice that is offered at the National Gallery of London,which is also the current maintainer of the system. Thebenchmark includes fundamental image operations such asan affine transformation and a convolution. It was chosenbecause image transformations are a common task on desk-top computers and for the ability of the VASARI system to

construct multi-threaded image processing pipelines trans-parently on the fly. Future libraries might use concepts suchas the ones employed by VIPS to make multi-threaded func-tionality available to the user.

The image transformation pipeline of the vips benchmarkhas 18 stages. It is implemented in the VIPS operationim benchmark. The stages can be grouped into the followingkernels:

Crop The first step of the pipeline is to remove 100 pixelsfrom all edges with VIPS operation im extract area.

Shrink Next, vips shrinks the image by 10%. This affinetransformation is implemented as the matrix operation

f(~x) =»

0.9 00 0.9

–~x +

»00

–in VIPS operation im affine. The transformation usesbilinear interpolation to compute the output values.

Adjust white point and shadows To improve the per-ceived visual quality of the image under the expectedtarget conditions, vips brightens the image, adjuststhe white point and pulls the shadows down. Theseoperations require several linear transformations and amatrix multiplication, which are implemented in im lin-tra, im lintra vec and im recomb.

Sharpen The last step slightly exaggerates the edges of theoutput image in order to compensate for the blurringcaused by printing and to give the image a better over-all appearance. This convolution employs a Gaussianblur filter with mask radius 11 and a subtraction inorder to isolate the high-frequency signal componentof the image. The intermediate result is transformedvia a look-up table shaped as

f(x) =

8><>:0.5x |x| ≤ 2.51.5x + 2.5 x < −2.51.5x − 2.5 x > 2.5

and added back to the original image to obtain thesharpened image. Sharpening is implemented in VIPSoperation im sharpen.

The VASARI Image Processing System fuses all image op-erations to construct an image transformation pipeline thatcan operate on subsets of an image. VIPS can automat-ically replicate the image transformation pipeline in orderto process multiple image regions concurrently. This hap-pens transparently for the user of the library. Actual imageprocessing and any I/O is deferred as long as possible. Inter-mediate results are represented in an abstract way by partialimage descriptors. Each VIPS operation can specify a de-mand hint which is evaluated to determine the work unitsize of the combined pipeline. VIPS uses memory-mappedI/O to load parts of an input image on demand. After therequested part of a file has been loaded, all image operations

12


are applied to the image region before the output region iswritten back to disk.

A VIPS operation is composed of the main function whichprovides the public interface employed by the users, thegenerate function which implements the actual image op-eration, as well as a start and a stop function. The mainfunctions register the operation with the VIPS evaluationsystem. Start functions are called by the runtime systemto perform any per-thread initialization. They produce asequence value which is passed to all generate functions andthe stop function. Stop functions handle the shutdown atthe end of the evaluation phase and destroy the sequencevalue. The VIPS system guarantees the mutually exclusiveexecution of start and stop functions, which can thus beused to communicate between threads during the pipelineinitialization or shutdown phase. The generate functionstransform the image and correspond to the pipeline stages.

The sizes of the images used for the input sets for vips are:

• test: 256 × 288 pixels

• simdev: 256 × 288 pixels

• simsmall: 1, 600 × 1, 200 pixels

• simmedium: 2, 336 × 2, 336 pixels

• simlarge: 2, 662 × 5, 500 pixels

• native: 18, 000 × 18, 000 pixels

3.2.12 x264The x264 application is an H.264/AVC (Advanced VideoCoding) video encoder. In the 4th annual video codec com-parison[41] it was ranked 2nd best codec for its high en-coding quality. It is based on the ITU-T H.264 standardwhich was completed in May 2003 and which is now alsopart of ISO/IEC MPEG-4. In that context the standard isalso known as MPEG-4 Part 10. H.264 describes the lossycompression of a video stream[43]. It improves over previ-ous video encoding standards with new features such as in-creased sample bit depth precision, higher-resolution colorinformation, variable block-size motion compensation (VB-SMC) or context-adaptive binary arithmetic coding (CABAC).These advancements allow H.264 encoders to achieve a higheroutput quality with a lower bit-rate at the expense of asignificantly increased encoding and decoding time. Theflexibility of H.264 allows its use in a wide range of con-texts with different requirements, from video conferencingsolutions to high-definition (HD) movie distribution. Next-generation HD DVD or Blu-ray video players already requireH.264/AVC encoding. The flexibility and wide range of ap-plication of the H.264 standard and its ubiquity in next-generation video systems are the reasons for the inclusion ofx264 in the PARSEC benchmark suite.

H.264 encoders and decoders operate on macroblocks of pix-els which have the fixed size of 16 × 16 pixels. Various tech-niques are used to detect and eliminate data redundancy.The most important one is motion compensation. It is em-ployed to exploit temporal redundancy between successiveframes. Motion compensation is usually the most expensive

operation that has to be executed to encode a frame. Ithas a very high impact on the final compression ratio. Thecompressed output frames can be encoded in one of threepossible ways:

I-Frame An I-Frame includes the entire image and does notdepend on other frames. All its macroblocks are en-coded using intra prediction. In intra mode, a predic-tion block is formed using previously encoded blocks.This prediction block is subtracted from the currentblock prior to encoding.

P-Frame These frames include only the changed parts ofan image from the previous I- or P-frame. A P-Frameis encoded with intra prediction and inter predictionwith at most one motion-compensated prediction sig-nal per prediction block. The prediction model is formedby shifting samples from previously encoded frames tocompensate for motion such as camera pans.

B-Frame B-Frames are constructed using data from theprevious and next I- or P-Frame. They are encodedlike a P-frame but using inter prediction with twomotion-compensated prediction signals. B-Frames canbe compressed much more than other frame types.

The enhanced inter and intra prediction techniques of H.264are the main factors for its improved coding efficiency. Theprediction schemes can operate on block of varying size andshapes which can be as small as 4 × 4 pixels.

The parallel algorithm of x264 uses the pipeline model withone stage per input video frame. This results in a virtualpipeline with as many stages as there are input frames. x264processes a number of pipeline stages equal to the numberof encoder threads in parallel, resulting in a sliding windowwhich moves from the beginning of the pipeline to its end.For P- and B-Frames the encoder requires the image dataand motion vectors from the relevant region of the referenceframes in order to encode the current frame, and so eachstage makes this information available as it is calculatedduring the encoding process. Fast upward movements canthus cause delays which can limit the achievable speedup ofx264 in practice. In order to compensate for this effect, theparallelization model requires that x264 is executed witha number of threads greater than the number of cores toachieve maximum performance.

x264 calls function x264 encoder encode to encode anotherframe. x264 encoder encode uses function x264 slicetype -decide to determine as which type the frame will be encodedand calls all necessary functions to produce the correct out-put. It also manages the threading functionality of x264.Threads use the functions x264 frame cond broadcast andx264 frame cond wait to inform each other of the encodingprogress and to make sure that no data is accessed while itis not yet available.

The videos used for the input sets have been derived fromthe uncompressed version of the short film ”Elephants Dream”[3].The number of frames determines the amount of parallelism.The exact characteristics of the input sets are:

13


• test: 32 × 18 pixels, 1 frame

• simdev: 64 × 36 pixels, 3 frames

• simsmall: 640 × 360 pixels ( 13 HDTV resolution), 8

frames

• simmedium: 640 × 360 pixels ( 13 HDTV resolution), 32

frames

• simlarge: 640 × 360 pixels ( 13 HDTV resolution), 128

frames

• native: 1, 920 × 1, 080 pixels (HDTV resolution), 512frames

4. METHODOLOGYIn this section we explain how we characterized the PAR-SEC benchmark suite. We are interested in the followingcharacteristics:

Parallelization PARSEC benchmarks use different paral-lel models which have to be analyzed in order to knowwhether the programs can scale well enough for theanalysis of CMPs of a certain size.

Working sets and locality Knowledge of the cache re-quirements of a workload are necessary to identify bench-marks suitable for the study of CMP memory hierar-chies.

Communication to computation ratio and sharing Thecommunication patterns of a program determine thepotential impact of private caches and the on-chip net-work on performance.

Off-chip traffic The off-chip traffic requirements of a pro-gram are important to understand how off-chip band-width limitations of a CMP can affect performance.

In order to characterize all applications, we had to makeseveral trade-off decisions. Given a limited amount of com-putational resources, higher accuracy comes at the expenseof a lower number of experiments. We followed the approachof similar studies[44, 22] and chose faster but less accu-rate execution-driven simulation to characterize the PAR-SEC workloads. This approach is feasible because we limitourselves to study fundamental program properties whichshould have a high degree of independence from architec-tural details. Where possible we supply measurement resultsfrom real machines. This methodology made it possible togather the large amount of data which we present in thisstudy. We preferred machine models comparable to realprocessors over unrealistic models which might have been abetter match for the program needs.

4.1 Experimental SetupWe used CMP$im[22] for our workload characterization.CMP$im is a plug-in for Pin[2] that simulates the cache hi-erarchy of a CMP. Pin is similar to the ATOM toolkit forCompaq’s Tru64 Unix on Alpha processors. It uses dynamicbinary instrumentation to insert routines at arbitrary pointsin the instruction stream. For the characterization we sim-ulate a single-level cache hierarchy of a CMP and vary its

parameters. The baseline cache configuration was a shared4-way associative cache with 4 MB capacity and 64 bytelines. By default the workloads used 8 cores. All experi-ments were conducted on a set of Symmetric Multiproces-sor (SMP) machines with x86 processors and Linux. Theprograms were compiled with gcc 4.2.1.

Because of the large computational cost we could not per-form simulations with the native input set, instead we usedthe simlarge inputs for all simulations and analytically de-scribe any differences between the two sets of which weknow.

4.2 Methodological Limitations and Error Mar-gins

For their characterization of the SPLASH-2 benchmark suite,Woo et al. fixed a timing model which they used for all ex-periments[44]. They give two reasons: First, nondetermin-istic programs would otherwise be difficult to compare be-cause different execution paths could be taken, and second,the characteristics they study are largely independent froman architecture. They also state that they believe that thetiming model should have only a small impact on the results.While we use similar characteristics and share this belief, wethink a characterization study of multi-threaded programsshould nevertheless analyze the impact of nondeterminismon the reported data. Furthermore, because our methodol-ogy is based on execution on real machines combined withdynamic binary instrumentation, it can introduce additionallatencies, and a potential concern is that the nondetermin-istic thread schedule is altered in a way that might affectour results in unpredictable ways. We therefore conducteda sensitivity analysis to quantify the impact of nondetermin-ism.

Alameldeen and Wood studied the variability of nondeter-ministic programs in more detail and showed that even smallpseudo-random perturbations of memory latencies are ef-fective to force alternate execution paths[5]. We adoptedtheir approach and modified CMP$im to add extra delaysto its analysis functions. Because running all experimentsmultiple times as Alameldeen and Wood did would be pro-hibitively expensive, we instead decided to randomly selecta subset of all experiments for each metric which we use andreport its error margins.

The measured quantities deviated by no more than ±0.04%from the average, with the following two exceptions. Thefirst excpetion is metrics of data sharing. In two cases(bodytrack and swaptions) the classification is noticeablyaffected by the nondeterminism of the program. This is par-tially caused because shared and thread-private data con-tend aggressively for a limited amount of cache capacity.The high frequency of evictions made it difficult to classifylines and accesses as shared or private. In these cases, themaximum deviation of the number of accesses from the av-erage was as high as ±4.71%, and the amount of sharingdeviated by as much as ±15.22%. We considered this un-certainty in our study and did not draw any conclusionswhere the variation of the measurements did not allow it.The second case of high variability is when the value of themeasured quantity is very low (below 0.1% miss rate or cor-responding ratio). In these cases the nondeterministic noise

14


0 4 8 12 160

4

8

12

16

Ideal

Blackscholes

Bodytrack

Canneal

Dedup

Facesim

Ferret

Fluidanimate

Freqmine

Streamcluster

Swaptions

Vips

X264

Number of Cores

Ach

ievab

le S

peed

up

Figure 1: Upper bound for speedup of PARSECworkloads with input set simlarge based on instruc-tion count. Limitations are caused by serial sec-tions, growing parallelization overhead and redun-dant computations.

made measurements difficult. We do not consider this aproblem because in this study we focus on trends of ratios,and quantities that small do not have a noticeable impact.It is however an issue for the analysis of working sets if themiss rate falls below this threshold and continues to decreaseslowly. Only few programs are affected, and our estimate oftheir working set sizes might be slightly off in these cases.This is primarily an issue inherent to experimental workingset analysis, since it requires well-defined points of inflectionfor conclusive results. Moreover, we believe that in thesecases the working set size varies nondeterministically, andresearchers should expect slight variations for each bench-mark run.

The implications of these results are twofold: First, theyshow that our methodology is not susceptible to the non-deterministic effects of multi-threaded programs in a waythat might invalidate our findings. Second, they also con-firm that the metrics which we present in this paper arefundamental program properties which cannot be distortedeasily. The reported application characteristics are likely tobe preserved on a large range of architectures.

5. PARALLELIZATIONIn this section we discuss the parallelization of the PAR-SEC suite. As we will see in Section 6, several PARSECbenchmarks (canneal, dedup, ferret and freqmine) haveworking sets so large they should be considered unboundedfor an analysis. These working sets are only limited by theamount of main memory in practice and they are activelyused for inter-thread communication. The inability to usecaches efficiently is a fundamental property of these pro-gram and affects their concurrent behavior. Furthermore,dedup and ferret use a complex, heterogeneous paralleliza-tion model in which specialized threads execute differentfunctions with different characteristics at the same time.These programs employ a pipeline with dedicated threadpools for each parallelized pipeline stage. Each thread pool

Blackscholes

Bodytrack

Canneal

Dedup

Facesim

Ferret

Fluidanimate

Freqmine

Streamcluster

Swaptions

Vips

X264

0.00%

2.50%

5.00%

7.50%

10.00%

Figure 2: Parallelization overhead of PARSECbenchmarks. The chart shows the slowdown of theparallel version on 1 core over the serial version.

has enough threads to occupy the whole CMP, and it isthe responsibility of the scheduler to assign cores to threadsin a manner that maximizes the overall throughput of thepipeline. Over time, the number of threads active for eachstage will converge against the inverse throughput ratios ofthe individual pipeline stages relative to each other.

Woo et al. use an abstract machine model with a uniforminstruction latency of one cycle to measure the speedups ofthe SPLASH-2 programs[44]. They justify their approachby pointing out that the impact of the timing model on thecharacteristics which they measure - including speedup - islikely to be low. Unfortunately, this is not true in general forPARSEC workloads. While we have verified in Section 4.2that the fundamental program properties such as miss rateand instruction count are largely not susceptible to timingshocks, the synchronization and timing behavior of the pro-grams is. Using a timing model with perfect caches sig-nificantly alters the behavior of programs with unboundedworking sets, for example how long locks to large, shareddata structures are held. Moreover, any changes of the tim-ing model have a strong impact on the number of activethreads of programs which employ thread specialization. Itwill thus affect the load balance and synchronization be-havior of these workloads. We believe it is not possible todiscuss the timing behavior of these programs without alsoconsidering for example different schedulers, which is beyondthe scope of this paper. Similar dependencies of commercialworkloads on their environment are already known[9, 4].

Unlike Woo et al. who measured actual concurrency on anabstract machine, we therefore decided to analyze inherentconcurrency and its limitations. Our approach is based onthe number of executed instructions in parallel and serial re-gions of the code. We neglect any delays due to blocking oncontended locks and load imbalance. This methodology isfeasible because we do not study performance, our interestis in fundamental program characteristics. The presenteddata is largely timing-independent and a suitable measureof the concurrency inherent in a workload. The results in

15


Figure 1 show the maximum achievable speedup measuredthat way. The numbers account for limitations such as un-parallelized code sections, synchronization overhead and re-dundant computations. PARSEC workloads can achieve ac-tual speedups close to the presented numbers. We verifiedon a large range of architectures that lock contention andother timing-dependent factors are not limiting factors, butwe know of no way to show it in a platform-independentway given the complications outlined above. The maximumspeedup of bodytrack, x264 and streamcluster is limitedby serial sections of the code. fluidanimate is primarilylimited by growing parallelization overhead. On real ma-chines, x264 is furthermore bound by a data dependencybetween threads, however this has only a noticeable impacton machines larger than the ones described here. It is rec-ommended to run x264 with more threads than cores, sincemodeling and exposing these dependencies to the scheduleris a fundamental aspect of its parallel algorithm, compara-ble to the parallel algorithms of dedup and ferret. Figure 2shows the slowdown of the parallel version on 1 core over theserial version. The numbers show that all workloads use ef-ficient parallel algorithms which are not substantially slowerthan the corresponding serial algorithms.

PARSEC programs scale well enough to study CMPs. Webelieve they are also useful on machines larger than theones analyzed here. The PARSEC suite exhibits a widervariety of parallelization models than previous benchmarksuites such as the pipeline model. Some of its workloadscan adapt to different timing models and can use threads tohide latencies. It is important to analyze these programs inthe context of the whole system.

6. WORKING SETS AND LOCALITYThe temporal locality of a program can be estimated byanalyzing how the miss rate of a processor’s cache changes asits capacity is varied. Often the miss rate does not decreasecontinuously as the size of a cache is increased, but stays ona certain level and then makes a sudden jump to a lower levelwhen the capacity becomes large enough to hold the nextimportant data structure. For CMPs an efficient functioningof the last cache level on the chip is crucial because a missin the last level will require an access to off-chip memory.

To analyze the working sets of the PARSEC workloads westudied a cache shared by all processors. Our results are pre-sented in Figure 3. In Table 3 we summarize the importantcharacteristics of the identified working sets. Most work-loads exhibit well-defined working sets with clearly identifi-able points of inflection. Compared to SPLASH-2, PARSECworking sets are significantly larger and can reach hundredsof megabytes such as in the cases of canneal and freqmine.

Two types of workloads can be distinguished: The firstgroup contains benchmarks such as bodytrack and swaptionswhich have working sets no larger than 16 MB. These work-loads have a limited need for caches with a bigger capac-ity, and the latest generation of CMPs often already hascaches sufficiently large to accommodate most of their work-ing sets. The second group of workloads is composed of thebenchmarks canneal, ferret, facesim, fluidanimate andfreqmine. These programs have very large working setsof sizes 65 MB and more, and even with a relatively con-

strained input set such as simlarge, their working sets canreach hundreds of megabytes. Moreover, the need of thoseworkloads for cache capacity is nearly insatiable and growswith the amount of data which they process. In Table 3 wegive our estimates for the largest working set of each PAR-SEC workload for the native input set. In several casesthey are significantly larger and can even reach gigabytes.These large working sets are often the consequence of analgorithm that operates on large amounts of collected in-put data. ferret for example keeps a data base of featurevectors of images in memory to find the images most sim-ilar to a given query image. The cache and memory needsof these applications should be considered unbounded, asthey become more useful to their users if they can workwith increased amounts of data. Programs with unboundedworking sets are canneal, dedup, ferret and freqmine.

In Figure 4 we present our analysis of the spatial local-ity of the PARSEC workloads. The data shows how themiss rate of a shared cache changes with line size. All pro-grams benefit from larger cache lines, but to different ex-tents. facesim, fluidanimate and streamcluster show thegreatest improvement as the line size is increased, up to thethe maximum value of 256 bytes which we used. These pro-grams have streaming behavior, and an increased line sizehas a prefetching effect which these workloads can take ad-vantage of. facesim for example spends most of its timeupdating the position-based state of the model, for which itemploys an iterative Newton-Raphson algorithm. The algo-rithm iterates over the elements of a sparse matrix which isstored in two one-dimensional arrays, resulting in a stream-ing behavior. All other programs also show good improve-ment of the miss rate with larger cache lines, but only upto line sizes of about 128 bytes. The miss rate is not sub-stantially reduced with larger lines. This is due to a limitedsize of the basic data structures employed by the programs.They represent independent logical units, each of which isintensely worked with during a computational phase. Forexample, x264 operates on macroblocks of 8 × 8 pixels ata time, which limits the sizes of the used data structures.Processing a macroblock is computationally intensive andlargely independent from other macroblocks. Consequently,the amount of spatial locality is bounded in these cases.

For the rest of our analysis we chose a cache capacity of 4MB for all experiments. We could have used a matchingcache size for each workload, but that would have madecomparisons very difficult, and the use of very small or verylarge cache sizes is not realistic. Moreover, in the case of theworkloads with an unbounded working set size, a workingset which completely fits into a cache would be an artifactof the limited simulation input size and would not reflectrealistic program behavior.

7. COMMUNICATION-TO-COMPUTATIONRATIO AND SHARING

In this section we discuss how PARSEC workloads use cachesto communicate. Most PARSEC benchmarks share dataintensely. Two degrees of sharing can be distinguished:Shared data can be read-only during the parallel phase, inwhich case it is only used for lookups and analysis. Inputdata is frequently used in such a way. But shared data canalso be used for communication between threads, in which

16


1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

facesim

Mis

s R

ate

(%

)M

iss

Rate

(%

)M

iss

Rate

(%

)

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

blackscholes

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

0.50%

1.00%

1.50%

2.00%

bodytrack

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

10.00%

20.00%

30.00%

canneal

Cache Size (KB) Cache Size (KB) Cache Size (KB) Cache Size (KB)

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

ferret

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

1.00%

2.00%

3.00%

fluidanimate

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

0.50%

1.00%

1.50%

2.00%

freqmine

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

5.00%

10.00%

15.00%

20.00%

streamcluster

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

1.00%

2.00%

3.00%

4.00%

5.00%

swaptions

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

vips

WS1

WS2

WS1

WS2

WS1WS2

WS1

WS2 WS1

WS2

WS1

WS2

WS1

WS2 WS1 WS1WS2

WS1

WS2

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

2.00%

4.00%

6.00%

8.00%

10.00%

x264

WS1

WS2

1 2 4 8163264

128

256

512

1024

2048

4096

8192

16384

32768

65536

131072

262144

0.00%

0.50%

1.00%

1.50%

2.00%

dedup

WS1

WS2

Figure 3: Miss rates versus cache size. Data assumes a shared 4-way associative cache with 64 byte lines. WS1and WS2 refer to important working sets which we analyze in more detail in Table 3. Cache requirements ofPARSEC benchmark programs can reach hundreds of megabytes.

ProgramInput Set Input Setsimlarge native

Working Set 1 Working Set 2 Working Set 2Data Size Growth Data Size Growth Size

Structure(s) Rate Structure(s) Rate Estimateblackscholes options 64 KB C portfolio data 2 MB C samebodytrack edge maps 512 KB const. input frames 8 MB const. samecanneal elements 64 KB C netlist 256 MB DS 2 GBdedup data chunks 2 MB C hash table 256 MB DS 2 GBfacesim tetrahedra 256 KB C face mesh 256 MB DS sameferret images 128 KB C data base 64 MB DS 128 MBfluidanimate cells 128 KB C particle data 64 MB DS 128 MBfreqmine transactions 256 KB C FP-tree 128 MB DS 1 GBstreamcluster data points 64 KB C data block 16 MB user-def. 256 MBswaptions swaptions 512 KB C same as WS1 same same samevips image data 64 KB C image data 16 MB C samex264 macroblocks 128 KB C reference frames 16 MB C same

Table 3: Important working sets and their growth rates. DS represents the data set size and C is the numberof cores. Working set sizes are taken from Figure 3. Values for native input set are analytically derivedestimates. Working sets that grow proportional to the number of cores C are aggregated private workingsets and can be split up to fit into correspondingly smaller, private caches.

17


blackscholesbodytrack

fluidanimatefreqmine

swaptions cannealdedup

facesimferret

streamclustervips

x264

81

63

26

41

28

25

6 81

63

26

41

28

25

6 81

63

26

41

28

25

6 81

63

26

41

28

25

6 81

63

26

41

28

25

6

0.00%

0.25%

0.50%

0.75%

1.00%M

iss

Rate

(%

)

81

63

26

41

28

25

6 81

63

26

41

28

25

6 81

63

26

41

28

25

6 81

63

26

41

28

25

6 81

63

26

41

28

25

6 81

63

26

41

28

25

6 81

63

26

41

28

256

0.00%

5.00%

10.00%

15.00%

Stores

Loads

Figure 4: Miss rates as a function of line size. Data assumes 8 cores sharing a 4-way associative cache with4 MB capacity. Miss rates are broken down to show the effect of loads and stores.

blackscholes dedupbodytrack ferret

freqmineswaptions

vipsx264

81

632

64

128

256 8

16

32

64

128

256 8

16

32

64

128

256 8

16

32

64

128

256 8

16

32

64

128

256 8

16

32

64

128

256 8

16

32

64

128

256 8

16

32

64

128

256

0.00%

5.00%

10.00%

15.00%

20.00%

25.00%

30.00%

35.00%

40.00%

45.00%

50.00%

55.00%

>8 Sharers

8 Sharers

7 Sharers

6 Sharers

5 Sharers

4 Sharers

3 Sharers

2 Sharers

Sh

are

d L

ines

(%)

Figure 5: Portion of a 4-way associative cache with 4 MB capacity which is shared. The line size is variedfrom 8 to 256 bytes. Data assumes 8 cores and is broken down to show the number of threads sharing thelines.

case it is also modified during the parallel phase. In Figure 5we show how the line size affects sharing. The data com-bines the effects of false sharing and the access pattern ofthe program due to constrained cache capacity. In Figure 6,we analyze how the program uses its data. The chart showswhat data is accessed and how intensely it is used. The in-formation is broken down in two orthogonal ways, resultingin four possible types of accesses: Read and write accessesand accesses to thread-private and shared data. Addition-ally, we give numbers for true shared accesses. An access isa true access if the last reference to that line came from an-other thread. True sharing does not count repeated accessesby the same thread. It is a useful metric to estimate the re-quirements for the cache coherence mechanism of a CMP: Atrue shared write can trigger a coherence invalidate or up-date, and a true shared read might require the replicationof data. All programs exhibit very few true shared writes.Cache misses are not included in Figure 6, we analyze them

separately when we discuss off-chip traffic in Section 8.

Four programs (canneal, facesim, fluidanimate and stream-cluster) showed only trivial amounts of sharing. They havetherefore not been included in Figure 5. In the case ofcanneal, this is a results of the small cache capacity. Mostof its large working set is shared and actively worked withby all threads. However, only a minuscule fraction of it fitsinto the cache, and the probability that a line is accessed bymore than one thread before it gets replaced is very smallin practice. With a 256 MB cache, 58% of its cached data isshared. blackscholes shows a substantial amount of shar-ing, but almost all its shared data is only accessed by twothreads. This is a side-effect of the parallelization model: Atthe beginning of the program, the boss threads initializes theportfolio data before it spawns worker threads which processparts of it in a data-parallel way. As such, the entire port-folio is shared between the boss thread and its workers, but

18


Tra

ffic

(B

ytes

/ In

str.

)

1 2 4 816

0

1

2

3

4

5

1 2 4 816 1 2 4 8

16 1 2 4 8

16 1 2 4 8

16 1 2 4 8

16 1 2 4 8

16 1 2 4 8

16 1 2 4 8

16 1 2 4 8

16 1 2 4 8

16 1 2 4 8

16

blackscholesbodytrack

fluidanimatefreqmine swaptions

cannealdedup

facesimferret

streamcluster vipsx264

Private Reads

Private Writes

Shared Reads

Shared Writes

True SharedReads

True SharedWrites

Figure 6: Traffic from cache in bytes per instruction for 1 to 16 cores. Data assumes a shared 4-way associativecache with 64 byte lines. Results are broken down in accesses to private and shared data. True accesses donot count repeated accesses from the same thread.

the worker threads can process the options independentlyfrom each other and do not have to communicate with eachother. ferret shows a modest amount of data sharing. Likethe sharing behavior of canneal, this is caused by severelyconstrained cache capacity. ferret uses a database that isscanned by all threads to find entries similar to the queryimage. However, the size of the database is practically un-bounded, and because threads do not coordinate their scanswith each other it is unlikely that a cache line gets accessedmore than once. bodytrack and freqmine exhibit substan-tial amounts of sharing due to the fact that threads processthe same data. The strong increase of sharing of freqmineis caused by false sharing, as the program uses an array-based tree as its main data structure. Larger cache lineswill contain more nodes, increasing the chance that the lineis accessed by multiple threads. vips has some shared datawhich is mostly used by only two threads. This is also pre-dominantly an effect of false sharing since image data isstored in a consecutive array which is processed in a data-parallel way by threads. x264 uses significant amounts ofshared data, most of which is only accessed by a low num-ber of threads. This data is the reference frames, since athread needs this information from other stages in orderto encode the frame it was assigned. Similarly, the largeamount of shared data of dedup is the input which is passedfrom stage to stage.

Most PARSEC workloads use a significant amount of com-munication, and in many cases the volume of traffic betweenthreads can be so high that efficient data exchange via ashared cache is severely constrained by its capacity. An ex-ample for this is x264. Figure 6 shows a large amount ofwrites to shared data, but contrary to intuition its share di-minishes rapidly as the number of cores is increased. Thiseffect is caused by a growth of the working sets of x264:Table3 shows that both working set WS1 and WS2 grow pro-portional to the number of cores. WS1 is mostly composedof thread-private data and is the one which is used moreintensely. WS2 contains the reference frames and is used

for inter-thread communication. As WS1 grows, it starts todisplace WS2, and the threads are forced to communicatevia main memory. Two more programs which communicateintensely are dedup and ferret. Both programs use thepipeline parallelization model with dedicated thread poolsfor each parallel stage, and all data has to be passed fromstage to stage. fluidanimate also shows a large amount ofinter-thread communication, and its communication needsgrow as the number of threads increase. This is caused bythe spatial partitioning that fluidanimate uses to distributethe work to threads. Smaller partitions mean a worse sur-face to volume ratio, and communication grows with thesurface.

Overall, most PARSEC workloads have complex sharingpatterns and communicate actively. Pipelined programs canrequire a large amount of bandwidth between cores in orderto communicate efficiently. Shared caches with insufficientcapacity can limit the communication efficiency of work-loads, since shared data structures might get displaced tomemory.

8. OFF-CHIP TRAFFICIn this section we analyze what the off-chip bandwidth re-quirements of PARSEC workloads are. Our goal is to under-stand how the traffic of an application grows as the numberof cores of a CMP increases and how the memory wall willlimit performance. We again simulated a shared cache andanalyze how traffic develops as the number of cores increases.Our results are presented in Figure 7.

The data shows that the off-chip bandwidth requirements ofblackscholes are small enough so that memory bandwidthis unlikely to be an issue. bodytrack, dedup, fluidanimate,freqmine, swaptions and x264 are more demanding. More-over, these programs exhibit a growing bandwidth demandper instruction as the number of cores increases. In thecase of bodytrack, most off-chip traffic happens in short, in-tense bursts since the off-chip communication predominantly

19


1 2 4 81

6 1 2 4 81

6 1 2 4 81

6 1 2 4 81

6 1 2 4 81

6 1 2 4 81

6 1 2 4 81

6

0

0.03

0.05

0.08

0.1

Tra

ffic

(B

yte

s/In

str.

)

1 2 4 81

6 1 2 4 81

6 1 2 4 81

6 1 2 4 81

6 1 2 4 81

6

0

0.25

0.5

0.75

1

Writebacks

Stores

Loads

blackscholesbodytrack fluidanimate

freqminededupswaptions vips

x264canneal

facesimferret

streamcluster

Figure 7: Breakdown of off-chip traffic into bytes for loads, stores and write-backs per instructions. Resultsare shown for 1 to 16 cores. Data assumes a 4-way associative 4 MB cache with 64 byte lines, allocate-on-storeand write-back policy.

takes place during the edge map computation. This phaseis only a small part of the serial runtime, but on machineswith constrained memory bandwidth it quickly becomes thelimiting factor for scalability. The last group of programsis composed of canneal, facesim, ferret, streamclusterand vips. These programs have very high bandwidth re-quirements and also large working sets. canneal shows adecreasing demand for data per instruction with more cores.This behavior is caused by improved data sharing.

It is important to point out that these numbers do nottake the increasing instruction throughput of a CMP intoaccount as its number of cores grows. A constant trafficamount in Figure 7 means that the bandwidth requirementsof an application which scales linearly will grow exponen-tially. Since many PARSEC workloads have high band-width requirements and working sets which exceed conven-tional caches by far, off-chip bandwidth will be their mostsevere limitation of performance. Substantial architecturalimprovements are necessary to allow emerging workloads totake full advantage of larger CMPs.

9. CONCLUSIONSThe PARSEC benchmark suite is designed to provide par-allel programs for the study for CMPs. It focuses on emerg-ing desktop and server applications and does not have thelimitations of other benchmark suites. It is diverse enoughto be considered representative, it is not skewed towardsHPC programs, it uses state-of-art algorithms and it sup-ports research. In this study we characterized the PARSECworkloads to provide the basic understanding necessary toallow other researchers the effective use of PARSEC for theirstudies. We analyzed the parallelization, the working setsand locality, the communication-to-computation ratio andthe off-chip traffic of its workloads. The high cost of mi-croarchitectural simulation forced us to use smaller machineand problem sizes than we would really like to evaluate.

Our analysis shows that current CMPs are not sufficiently

prepared for the demands of future applications. Many ar-chitectural problems still have to be solved to create micro-processors which are ready for the next generation of work-loads. Large working sets and high off-chip bandwidth de-mands might limit the scalability of future programs. PAR-SEC can be used to drive research efforts by applicationdemands.

10. ACKNOWLEDGMENTSFirst and foremost we would like to acknowledge the manyauthors of the PARSEC benchmark programs which aretoo numerous to be listed here. The institutions who con-tributed the most number of programs are Intel and Prince-ton University. Stanford University allowed us to use theircode and data for facesim.

We would like to acknowledge the contribution of the fol-lowing individuals: Justin Rattner, Pradeep Dubey, TimMattson, Jim Hurley, Bob Liang, Horst Haussecker, YeminZhang, and Ron Fedkiw. They convinced skeptics and sup-ported us so that a project the size of PARSEC could suc-ceed.

11. REFERENCES[1] MediaBench II.

http://euler.slu.edu/~fritts/mediabench/.

[2] Pin. http://rogue.colorado.edu/pin/.

[3] Elephants Dream. http://www.elephantsdream.org/, 2006.

[4] A. Alameldeen, C. Mauer, M. Xu, P. Harper, M. Martin,and D. Sorin. Evaluating Non-DeterministicMulti-Threaded Commercial Workloads. In Proceedings ofthe Computer Architecture Evaluation using CommercialWorkloads, February 2002.

[5] A. Alameldeen and D. Wood. Variability in ArchitecturalSimulations of Multi-threaded Workloads. In Proceedings ofthe 9th International Symposium on High-PerformanceComputer Architecture, February 2003.

[6] A. Balan, L. Sigal, and M. Black. A QuantitativeEvaluation of Video-based 3D Person Tracking. In IEEEWorkshop on VS-PETS, pages 349–356, 2005.

20


[7] P. Banerjee. Parallel algorithms for VLSI computer-aideddesign. Prentice-Hall, Inc., Upper Saddle River, NJ, USA,1994.

[8] J. Barnes and P. Hut. A hierarchical O(N log N)force-calculation algorithm. Nature, 324:446–449, December1986.

[9] L. Barroso, K. Gharachorloo, and F. Bugnion. MemorySystem Characterization of Commercial Workloads. InProceedings of the 25th International Symposium onComputer Architecture, pages 3–14, June 1998.

[10] Black, Fischer, and Scholes. The Pricing of Options andCorporate Liabilities. Journal of Political Economy,81:637–659, 1973.

[11] S. Brin, J. Davis, and H. Garcia-Molina. Copy DetectionMechanisms for Digital Documents. In Proceedings ofSpecial Interest Group on Management of Data, 1995.

[12] M. Desbrun and M.-P. Gascuel. Smoothed Particles: A newparadigm for animating highly deformable bodies. InComputer Animation and Simulation ’96, pages 61–76,August 1996.

[13] J. Deutscher and I. Reid. Articulated Body MotionCapture by Stochastic Search. International Journal ofComputer Vision, 61(2):185–205, February 2005.

[14] P. Dubey. Recognition, Mining and Synthesis MovesComputers to the Era of Tera. Technology@Intel Magazine,February 2005.

[15] G. Grahne and J. Zhu. Efficiently Using Prefix-trees inMining Frequent Itemsets. November 2003.

[16] J. Han, J. Pei, and Y. Yin. Mining Frequent Patternswithout Candidate Generation. In W. Chen, J. Naughton,and P. A. Bernstein, editors, 2000 ACM SIGMODInternational Conference on Management of Data, pages1–12. ACM Press, 05 2000.

[17] D. Heath, R. Jarrow, and A. Morton. Bond Pricing and theTerm Structure of Interest Rates: A New Methodology forContingent Claims Valuation. Econometrica, 60(1):77–105,January 1992.

[18] J. L. Hennessy and D. A. Patterson. ComputerArchitecture: A Quantitative Approach. MorganKaufmann, 2003.

[19] L. Hernquist and N. Katz. TreeSPH - A unification of SPHwith the hierarchical tree method. The AstrophysicalJournal Supplement Series, 70:419, 1989.

[20] C. J. Hughes, R. Grzeszczuk, E. Sifakis, D. Kim, S. Kumar,A. P. Selle, J. Chhugani, M. Holliman, and Y.-K. Chen.Physical Simulation for Animation and Visual Effects:Parallelization and Characterization for ChipMultiprocessors. SIGARCH Computer Architecture News,35(2):220–231, 2007.

[21] J. C. Hull. Options, Futures, and Other Derivatives.Prentice Hall, 2005.

[22] A. Jaleel, M. Mattina, and B. Jacob. Last-Level Cache(LLC) Performance of Data-Mining Workloads on a CMP -A Case Study of Parallel Bioinformatics Workloads. InProceedings of the 12th International Symposium on HighPerformance Computer Architecture, February 2006.

[23] Jayaprakash Pisharath and Ying Liu and Wei-keng Liaoand Alok Choudhary and Gokhan Memik and JanakiParhi. NU-Minebench 2.0. Technical report, Center forUltra-Scale Computing and Information SecurityNorthwestern University, August 2005.

[24] R. M. Karp and M. O. Rabin. Efficient RandomizedPattern-Matching Algorithms. IBM Journal of Researchand Development, 31(2):249–260, 1987.

[25] M.-L. Li, R. Sasanka, S. V. Adve, Y.-K. Chen, andE. Debes. The ALPBench Benchmark Suite for ComplexMultimedia Applications. In Proceedings of the 2005International Symposium on Workload Characterization,October 2005.

[26] C. Lucchese, S. Orlando, R. Perego, and F. Silvestri.WebDocs: a real-life huge transactional dataset. In 2ndIEEE ICDM Workshop on Frequent Itemset MiningImplementations 2004, November 2004.

[27] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li.Ferret: A Toolkit for Content-Based Similarity Search ofFeature-Rich Data. In Proceedings of the 2006 EuroSysConference, pages 317–330, 2006.

[28] Q. Lv, W. Josephson, Z. Wang, M. Charikar, and K. Li.Multi-Probe LSH: Efficient Indexing for High-DimensionalSimilarity Search. In Proceedings of the 33rd InternationalConference on Very Large Data Bases, pages 950–961,2007.

[29] U. Manber. Finding Similar Files in a Large File System. InProceedings of the USENIX Winter 1994 TechnicalConference, pages 1–10, San Fransisco, CA, USA, October1994.

[30] K. Martinez and J. Cupitt. VIPS - a highly tuned imageprocessing software architecture. In Proceedings of the 2005International Conference on Image Processing, volume 2,pages 574–577, September 2005.

[31] M. Matsumoto and T. Nishimura. Mersenne Twister: A623-dimensionally equidistributed uniform pseudorandomnumber generator. In ACM Transactions on Modeling andComputer Simulation, volume 8, pages 3–30, January 1998.

[32] M. Muller, D. Charypar, and M. Gross. Particle-BasedFluid Simulation for Interactive Applications. InProceedings of the 2003 ACM SIGGRAPH/EurographicsSymposium on Computer Animation, pages 154–159,Aire-la-Ville, Switzerland, Switzerland, 2003. EurographicsAssociation.

[33] R. Nock and F. Nielsen. Statistical Region Merging. IEEETransactions on Pattern Analysis and MachineIntelligence, 26:1452–1458, 2004.

[34] L. O’Callaghan, A. Meyerson, R. M. N. Mishra, andS. Guha. High-Performance Clustering of Streams andLarge Data Sets. In Proceedings of the 18th InternationalConference on Data Engineering, February 2002.

[35] D. Pnueli and C. Gutfinger. Fluid Mechanics. CambridgeUniversity Press, 1992.

[36] S. Quinlan and S. D. Venti. A New Approach to ArchivalStorage. In Proceedings of the USENIX Conference on FileAnd Storage Technologies, January 2002.

[37] Y. Rubner, C. Tomasi, and L. J. Guibas. The EarthMover’s Distance as a Metric for Image Retrieval.International Journal of Computer Vision, 40:99–121,2000.

[38] E. Sifakis, I. Neverov, and R. Fedkiw. AutomaticDetermination of Facial Muscle Activations from SparseMotion Capture Marker Data. ACM Transactions onGraphics, 24(3):417–425, 2005.

[39] N. T. Spring and D. Wetherall. A Protocol-IndependentTechnique for Eliminating Redundant Network Traffic. InProceedings of ACM SIGCOMM, August 2000.

[40] J. Teran, E. Sifakis, G. Irving, and R. Fedkiw. RobustQuasistatic Finite Elements and Flesh Simulation. InProceedings of the 2005 ACM SIGGRAPH/EurographicsSymposium on Computer Animation, pages 181–190, NewYork, NY, USA, 2005. ACM Press.

[41] D. Vatolin, D. Kulikov, and A. Parshin. MPEG-4AVC/H.264 Video Codecs Comparison.http://compression.ru/video/codec_comparison/pdf/msu_mpeg_4_avc_h264_co%dec_comparison_2007_eng.pdf,2007.

21


[42] L. Verlet. Computer Experiments on Classical Fluids. I.Thermodynamical Properties of Lennard-Jones Molecules.Physical Review, 159:98–103, 1967.

[43] T. Wiegand, G. J. Sullivan, G. Bjontegaard, andA. Luthra. Overview of the H.264/AVC Video CodingStandard. IEEE Transactions on Circuits and Systems forVideo Technology, 13(7):560–576, 2003.

[44] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta.The SPLASH-2 Programs: Characterization andMethodological Considerations. In Proceedings of the 22ndInternational Symposium on Computer Architecture, pages24–36, June 1995.

[45] G. Xu. A New Parallel N-Body Gravity Solver: TPM. TheAstrophysical Journal Supplement Series, 98:355, 1995.

[46] T. Y. Yeh, P. Faloutsos, S. Patel, and G. Reinman.ParallAX: An Architecture for Real-Time Physics. InProceedings of the 34th International Symposium onComputer Architecture, June 2007.

22

The PARSEC Benchmark Suite: Characterization and...

Documents

Transcript of The PARSEC Benchmark Suite: Characterization and...