[IEEE 2013 International Conference on High Performance Computing & Simulation (HPCS) - Helsinki,...

Prototyping Parallel Simulations on ManycoreArchitectures Using Scala: A Case Study

Jonathan Passerat-Palmbach∗†‡ Romain Reuillon§ Claude Mazel∗†‡ David R.C. Hill∗†‡

{passerat, mazel}@isima.fr [email protected] [email protected]∗Clermont Université, BP 10448, F-63000 CLERMONT-FERRAND†CNRS, UMR 6158, Université Blaise Pascal, LIMOS, F-63173

‡ISIMA, Institut Supérieur d’Informatique, de Modélisation et de leurs Applications, BP 10125, F-63177§Institut des Systèmes Complexes, 57-59 rue Lhomond, F-75005 PARIS

Abstract—At the manycore era, every simulationpractitioner can take advantage of the computinghorsepower delivered by the available high performancecomputing devices. From multicore CPUs (CentralProcessing Unit) to thousand-thread GPUs (GraphicsProcessing Unit), several architectures are now ableto offer great speed-ups to simulations. However, itis often tricky to harness them properly, and evenmore complicated to implement a few declinationsof the same model to compare the parallelizations.Thus, simulation practitioners would mostly benefitof a simple way to evaluate the potential benefits ofchoosing one platform or another to parallelize theirsimulations. In this work, we study the ability ofthe Scala programming language to fulfill this need.We compare the features of two frameworks in thisstudy: Scala Parallel Collections and ScalaCL. Bothof them provide facilities to set up a data-parallelismapproach on Scala collections. The capabilities of thetwo frameworks are benchmarked with three simulationmodels as well as a large set of parallel architectures.According to our results, these two Scala frameworksshould be considered by the simulation community toquickly prototype parallel simulations, and choose thetarget platform on which investing in an optimizeddevelopment will be rewarding.

Keywords—Parallelization of Simulation, Threads,OpenCL, Scala, Automatic Parallelization, GPU,Manycore

I. Introduction

Simulations tend to become more and more accurate,but also more and more complex. In the same time,manycore architectures and hardware accelerators are nowwidespread and allow great speedups to applications thatcan be parallelized in a way to take advantage of theiroverwhelming power. Anyway, several problems arise whenone is trying to parallelize a simulation. First, you needto decide which hardware accelerator will best fit yourapplication, second, you must master its architecture’scharacteristics, and last but not least, you need to choosethe programming language and Application ProgrammingInterface (API) that will best suit this particular hard-ware.

Depending on their underlying models and algorithms,simulations will display greater speedups when parallelizedon some architectures than others. For example, modelrelying on cellular automata algorithms are likely to scalesmoothly on GPU devices or any other vector architecture[1], [2].

The problem is sometimes, scientists champion a givenarchitecture without trying to evaluate the potential bene-fits their applications could gain from other architectures.This can result in disappointing performances from thenew parallel application and a speed-up far from theoriginal expectations. Such hasty decisions can also leadto wrong conclusions where an underexploited architecturewould display worse performance than the chosen one [3].One of the main examples of this circle of influence areGPUs, which have been at the heart of many publicationsfor the last 5 years. Parallel applications using this kindof platform are often proved relevant by comparing themto their sequential counterparts. Now, the question is: isit fair to compare the performances of an optimized GPUapplication to those of a single CPU core? Wouldn’t weobtain far better performances trying to make the most ofall the cores of a modern CPU?

On the other hand, many platforms are available today,and in view of the time and workload involved in the devel-opment of a parallel application for a given architecture,it is nearly impossible to invest much human resourcesin building several prototypes to determine which archi-tecture will best support an application. As a matter offact, parallel developers need programming facilities tohelp them build a reasonable set of parallel prototypestargeting a different parallel platform each.

This proposition implies that developers master manyparallel programming technologies if they want to beable to develop a set of prototypes. Thus, programmingfacilities must also help them to factor their codes as muchas possible. The ideal paradigm in this case is Write Once,Run Anywhere, that suggests different parallel platformsunderstand the same binaries. To do so, a standard called

978-1-4799-0838-7/13/$31.00 ©2013 IEEE 405

OpenCL (Open Computing Language) was proposed bythe Khronos group [4]. OpenCL aims at unifying develop-ments on various kinds of architectures like CPUs, GPUsand even FPGAs. It provides programming constructsbased upon C99 to express the actual parallel code (calledthe kernel). They are enhanced by APIs (ApplicationProgramming Interface) used to control the device and theexecution. OpenCL programs execution relies on specificdrivers issued by the manufacturer of the hardware theyrun on. The point is OpenCL kernels are not compiledwith the rest of the application, but on the fly at runtime.This allows specific tuning of the binary for the currentplatform.

OpenCL solves the aforementioned obstacles: as a cross-platform standard, it allows developing simulations onceand for all for any supported architecture. It is also agreat abstraction layer that lets clients concentrate onthe parallelization of their algorithm, and leave the de-vice specific mechanics to the driver. Still, OpenCL isnot a silver bullet; designing parallel applications mightstill appear complicated to scientists from many domains.Indeed, OpenCL development can be a real hindrance toparallelization. We need high-level programming APIs tohide this complexity to scientists, while being able to au-tomatically generate OpenCL kernels and initializations.

Another widespread cross-platform tool is Java andespecially its virtual machine execution platform. Thelatter makes of Java-enabled solutions a perfect matchfor our needs, since any Java-based development is WriteOnce, Run Anywhere. Although the Java Virtual Machine(JVM) tends to become more and more efficient, the Javalanguage itself evolves slower and several new languagesrunning on top of the JVM appeared during the lastfew years, including Clojure, Groovy and Scala. Theselanguages reduce code bloating and offer solutions tobetter handle concurrency and parallelism. Among JVMbased languages, we have chosen to focus on Scala inthis study because of its hybrid aspect: mixing bothfunctional and object oriented paradigms. By doing so,Scala allows object-oriented developers to gently migrateto powerful functional constructs. Meanwhile, functionalprogramming copes very well with parallelism due to itsimmutability principles. Having immutable objects andcollections highly simplifies parallel source code since nosynchronization mechanism is required when accessingthe original data. As many other functional languagessuch as Haskell, F# or Erlang, Scala has leveraged theseaspects to provide transparent parallelization facilities.In the standard library, these parallelization facilities usesystem threads and are limited to CPU parallelism. In thesame time, the ScalaCL library generates OpenCL codeat compile-time and produces OpenCL-enabled binariesfrom pure Scala code. Consequently, it is easier to gen-erate automatically parallelized functional programmingconstructs.

To sum up, OpenCL and Scala both appear like relevanttools to quickly build a various set of parallel prototypesfor a given application. This study therefore benchmarkssolutions where one or the two of the these technologieswere used to build prototypes of parallel simulations onmanycore CPU and GPU architectures. We concentrateon simulations that can benefit from data-parallelism inthis study. As a matter of fact, the tools that we havechosen are designed to deal with this kind of parallelism,as opposed to task-parallelism. We intend to show thatprototypes harnessing Scala’s automatic parallelizationframeworks display the same trend in terms of speed-upthan handcrafted parallel implementations. To do so, wewill:

• Introduce Scala frameworks providing automatic par-allelization facilities;

• Present the models at the heart of the study and theirproperties;

• Discuss the results of our benchmark when faced witha representative selection of parallel platforms;

• Show the relevancy of using Scala to quickly buildparallel prototypes prior to any technological choice.

II. Automatic Parallelization with Scala

A. Why is Scala a good candidate for parallelization ofsimulations?

Many attempts to generate parallel code from sequentialconstructs can be found in the literature. For the solecase of GPU programming with CUDA (Compute UnifiedDevice Architecture), a programming paradigm developedby NVIDIA that specifically runs on this manufacturer’shardware, we can cite great tools like HMPP [5], FCUDA[6] and Par4all [7]. Other studies [8], as well as ourown experience, show that CUDA displays far betterperformance on NVIDIA boards than OpenCL, since it isprecisely optimized by NVIDIA for their devices. However,automatically generated CUDA cannot benefit of the sametuning quality. That is why we rather consider OpenCLcode generation instead of CUDA. The former having beendesigned as a cross-platform technology, it is consequentlybetter suited for generic and automatic code production.First of all, let us make a brief recall on what is Scala.

Scala is a programming language mixing two paradigms:object oriented and functional programming. Its mainfeature is that it runs on top of the Java Virtual Machine(JVM). In our case, it means that Scala developments caninteroperate like clockwork with Java, thus allowing thewide range of simulations developed in Java to integrateScala parts without being modified.The second asset of Scala is its compiler. In fact, Scalac

(for Scala Compiler) offers the possibility to enhance itsbehavior through plugins. ScalaCL, which we will studylater in this section, uses a compiler plugin to transform

406

Scala code to OpenCL at compile time. This mecha-nism offers great opportunities to generate code and theOpenCL proposal studied in this work is just a concreteexample of what can be achieved when extending the Scalacompiler.

Finally, Scala presents a collection framework that in-trinsically facilitates parallelization. As a matter of fact,Scala default collections are immutable: every time afunction is applied to an immutable collection, this oneremains unchanged and the result is a modified copyreturned by the function. On the other hand, mutablecollections are also available when explicitly summoned,albeit the Scala specification does not ensure any thread-safe access on these collections. Such an approach ap-pears to be very interesting and efficient, when tryingto parallelize an application, since no lock mechanismsare involved anymore. Thus, concurrent accesses to thecollection elements are not a problem anymore as longas they are read-only accesses that don’t introduce anyoverhead. It is a classical functional programming patternto issue a new collection as the result of applying a functionto an initial immutable collection. Although this proceduremight appear costly, works have been done to optimize theway it is handled, and efficient implementations do notcopy the entire immutable collection [9].

B. Scala Parallel Collections

Scala 2.9 release introduced a new set of Parallel col-lections mirroring the classical ones. They have beendescribed in [10]. These parallel collections offer the samemethods than their sequential equivalents, but the methodexecution will be automatically parallelized by a frame-work implementing a divide and conquer algorithm.

The point is they integrate seamlessly in already existingsource codes because the parallel operations have thesame names as their sequential variants. As the paralleloperations are implemented in separate classes they can beinvoked if their data are in a parallel collection class. Thisis made possible thanks to a par method that returns aparallel equivalent of the sequential implementation of thecollection still pointing to the same data. Any subsequentoperation invoked by an instance of the collection willbenefit of a parallel execution without any other add fromthe client. Instead of applying the selected operation toeach member of the collection sequentially, it is appliedon each element in parallel. Such a function is referred toas closure in functional programming jargon. It commonlydesignates an anonymous function embedded in the bodyof another function. A closure can also access the variablesfrom the calling host function.

Scala Parallel Collections rely on the Fork/Join frame-work proposed by Doug Lea [11]. This framework wasreleased with the latest Java 7 SDK. It is based uponthe divide and conquer paradigm. Fork/Join introduces

the notion of tasks, which is basically a unit of work tobe performed. Tasks are then assigned to worker threadswaiting in a sleeping state in a Thread Pool. The pool ofworker threads is created once and only once when theframework is initialized. Thus, creating a new task doesnot suffer of thread creation and initialization that could,depending on the task content, be slower than processingthe task itself.Fork/Join is implemented using work stealing. This

adaptive scheduling technique offers efficient load balanc-ing features to the Java framework, provided that work issplit into tasks of a small enough granularity. Tasks areassigned to workers’ queues, but when a worker is idle, itcan steal tasks from another worker’s queue, which helpsachieve the whole computation faster. Scala Parallel Col-lections implement an exponential task splitting techniquedetailed in [12] to determine the ideal task granularity.

C. ScalaCL

ScalaCL is a project, part of the free and open-sourceNativelibs4java initiative, led by Olivier Chafik. The Na-tivelibs4java project is an ambitious bundle of librariestrying to allow users to take advantage of various nativebinaries in a Java environment.From ScalaCL itself, two projects have recently

emerged. The first one is named Scalaxy. It is a plugin forthe Scala compiler that intends to optimize Scala code atcompile time. Indeed, Scala functional constructs mightrun slower than their classical Java equivalents. Scalaxydeals with this problem by pre-processing Scala code toreplace some constructs by more efficient ones. Basically,this plugin intends to transform Scala’s loop-like calls suchas map or foreach by their while loops equivalents. Themain advantage of this tool is that it is applicable to anyScala code, without relying on any hardware.The second element resulting of this fork is the ScalaCL

collections. It consists in a set of collections that supporta restricted amount of Scala functions. However, thesefunctions can be mapped at compile time to their OpenCLequivalents. Up to version 0.2, a compiler plugin dedicatedto OpenCL generation was called at compile time tonormalize the code of the closure applied to the collection.Version 0.3, the current one, has led to a whole rewritingof the library. ScalaCL now leverages the Scala macrossystem, introduced in the 2.10 release of the Scala lan-guage. Thanks to macros, ScalaCL modifies the program’ssyntax tree during the compilation to provide transparentparallelization to manycore architectures from high levelfunctional constructs. In both cases, the resulting source isthen converted to an OpenCL kernel. At runtime, anotherpart of ScalaCL comes into play, since the rest of theOpenCL code, like the initializations, are coupled to thepreviously generated kernel to form the whole parallelapplication. The body of the closure will be computed

407

by an OpenCL Processing Element (PE), which can be athread or a core depending on the host where the programis being run.

The two parallelization approaches introduced in thissection are compared in Figure 1.

Scala

Scala Parallel Collec+ons

ScalaCL

Fork/Join

Collection.par

OpenCL

Collection.cl

PE1 PE2 PE3 PE4 T1 T2 T3 T4

Figure 1. Schema showing the similarities between the two ap-proaches: Scala Parallel Collections spreads the workload amongCPU threads (T), while ScalaCL spreads it among OpenCL Pro-cessing Elements (PEs)

III. Case study: three different simulationmodels

In this section, we will show how parallel Scala im-plementations of three different simulation models areclose to the sequential Scala implementation. We comparesequential Scala code with its parallel declinations usingScala, and put the light on the automatic aspect of the twostudied approaches, which respective source codes remainvery close to the genuine. Our benchmark consists inrunning several iterations of the automatically parallelizedmodels, and to compare them with handcrafted parallelimplementations.

The three models used in the benchmark were carefullychosen so that they remain simple to describe, while beingrepresentative of several main modeling methods. Theyare: the Ising Model [13], a forest Gap Model [14] and theSchelling’s segregation model [15], and will be describedmore thoroughly in this section. Three categories areconsidered to classify the models:

• Is the model stepwise or agent-based?• Does the model outputs depend on stochasticity?• Is the model performing more computations or data

accesses?

Let us sum the characteristics of our three models accord-ing to the three aforementioned categories in Table I.

A. The case of Discrete Event Simulations (DES)

Discrete-event based models do not cope well with data-parallelism approaches. In fact, they are more likely to fita task parallel framework. For instance, Odersky et al.introduce a discrete-event model of a circuit in [16]. Thepurpose of this model is to design and implement a simula-tor for digital circuits. Such a model is the perfect exampleof the difficulty to take advantage of parallel collectionsimplementations when the problem is not suited. Indeed,this parallel discrete-event model implies communicationsto broadcast the events. Furthermore, events lead the or-der in which task are executed, whereas data-parallel tech-niques rely on independent computations. Task parallelismapproaches are not considered in this work, but Scala alsoprovides ways to easily parallelize such problems throughthe Scala Actors framework [17]. The interested reader canfind further details on the task-parallel implementationof the previously mentioned digital circuit model usingActors in [16].

B. Models description

1) Ising models: The Ising model is a mathematicalmodel representing ferromagnetism in statistical physics[13]. Basically, Ising models deal with a lattice of points,each point holding a spin value, which is a quantum prop-erty of physical particles. At each step of the stochasticsimulation process and for every spin of the lattice, thealgorithms determines whether it has to be flipped or not.The decision to flip or not the spin of a point in the latticeis taken in view of two criteria. The first is the value of thespins belonging to the Von Neumann neighborhood of thecurrent point, and the second is a random process calledthe Metropolis criterion.

Ising models have been studied in many dimensions, butfor the purpose of this study, we consider a 2D toric lattice,solved using the Metropolis algorithm [18].

2) Forest Gap Model: The second model on which weapply automatic parallelization techniques is a Forest GapModel described in [14]. It depicts the dynamics of treesspawning and falling in a French Guiana rainforest area.The purpose of this model is to serve as a basis for a futureagent-based model rendering the settling of ant nests inthe area, depending on the trees and gaps locations. Ateach simulation step, a random number of trees will fall,carrying a random numbers of their neighbors in their fall.In the same time, new trees are spawning in the gaps areato repopulate them.

Here, we focus on the bottleneck of the model: a methodcalled several times at each simulation step that representsabout 70% of the whole execution time. Although thewhole model is stochastic, the part we consider in thiswork is purely computational. The idea is to figure out theboundaries of the gaps in the forest. The map is a matrix of

408

Stepwise/Agent-based Stochastic Computational/DataIsing Stepwise Yes Computational

Gap Model Stepwise No DataSchelling Agent-based Yes Data

Table ISummary of the studied models’ characteristics

boolean wherein cells carrying a true value represents partsof gaps, while any other cell carries a false value. Accordingto the value of its neighbors, a cell can determine whetherit is part of a gap boundary or not.

3) Schelling’s Segregation Model: The dynamic modelsof segregation of Schelling [15] is a specialized individualbased model that simulates the dynamic of two popu-lations of distinct colors. Each individual wants to livein an area where at least a certain ratio of individualsof the same color as his own are living. Individuals thatare unhappy with their neighborhood move at random toanother part of the map. This model converges toward aspace segregated by colors with huge clusters of individualof the same color.

At the heart of the segregation model is all the decisionstaken by the individuals to move from their place toanother. As long as this is the most computation inten-sive part of the segregation model, we will concentrateour parallelization efforts on this part of the algorithm.Furthermore, it presents a naturally parallel aspect sinceindividuals decide whether they feel the need to moveindependently from each other.

C. Scala implementations

In this section we will focus on the implementationdetails of the Ising model. The other implementationswould not bring more precisions concerning the use of thetwo Scala parallelization frameworks studied in this work.

Our Scala implementation of the Ising model representsthe spin lattice by an IndexedSeq. IndexedSeq is a trait,i.e. an enhanced interface in the sense of Java that alsoallows methods definition; it abstracts all sorts of indexedsequences. Indexed sequences are collections that have adefined order of elements and provide efficient randomaccess to the elements. It bears operations on this col-lection such as applying a given function to every elementof the collection (map) or applying a binary operator ona collection, going through elements from left or right(foldLeft, foldRight). These operations are sufficient toexpress most of the algorithms. Moreover, they can becombined since they build and return a new collectioncontaining the new values of the elements.

For instance, computing the magnetization of the wholelattice consists in applying a foldLeft on the IndexSeqto sum the values corresponding to the spin of the el-

ements. Indeed, our lattice is a set of tuples containedin the IndexedSeq. Each tuple stores its coordinates inthe 2D-lattice in order to easily build its Von Neumannneighborhood, and a boolean indicating the spin of theelement (false is a negative spin, whereas true stands fora positive spin).To harness parallel architectures, we need to slightly

rewrite this algorithm. The initial version consists in tryingto flip the spin of a single point of the lattice at each step.This action is considered as a transformation between twoconfigurations of the lattice. However, each point can betreated in parallel, provided that its neighbors are notupdated at the same time. Therefore, the points cannotbe chosen at random anymore, this would necessarily leadto a biased configuration A way to avoid this problem isto separate the lattice in two halves that will be processedsequentially. This technique is commonly referred to as the“checkerboard algorithm” [19]. In fact, the Von Neumannneighborhood of a point located on what will be considereda white square, will only be formed by points located onblack squares, and vice versa. The whole lattice is thenprocessed in two times to obtain a result equivalent to thesequential process. The process is summed up in Figure 2.

1st half updated 2nd half updated

merge

Updated lattice

Figure 2. Lattice updated in two times following a checkerboardapproach

409

The implementation lies in applying an operation toupdate each spin through two successive map invocationson the two-halves of the lattice. Not only this approachis crystal clear for the reader, but also it is quite easyto parallelize. Indeed, map can directly interact with thetwo Scala automatic parallelization frameworks presentedearlier: Scala Parallel Collections and ScalaCL. Let usdescribe how the parallelization APIs integrate smoothlyin already written Scala code with a concrete example.

Listing 1 is a snippet of our Ising model implementationin charge of updating the whole lattice in a sequentialfashion:

1 def processLattice(_lattice: Lattice)(implicitrng: Random) =

23 new Lattice {4 val size = _lattice.size5 val lattice =6 IndexedSeq.concat (7 _lattice.filter{case((x, y), _) => isEven

(x, y)}.map(spin => processSpin(_lattice, spin)),

8 _lattice.filter{case((x, y), _) => isOdd(x, y)}.map(spin => processSpin(_lattice, spin))

9 )10 }

Listing 1. Sequential version of method processLattice from classIsingModel

Scala enables us to write both concise and expressivecode, while keeping the most important parts of thealgorithm exposed. Here, the two calls to map aiming atupdating each half of the lattice can be noticed, they willprocess in turn all the elements within the subset theyhave received in input. This snippet suggests an obviousparallelization of this process thanks to the par methodinvocation. This call automatically provides the parallelequivalent of the collection, where all the elements will betreated in parallel by the mapped closure that processesthe energy of the spins. The resulting code differs onlyby the extra call to the par method upstream of the mapaction, so as Listing 2 shows:

1 def processLattice(_lattice: Lattice)(implicitrng: Random) =

23 new Lattice {4 val size = _lattice.size5 val lattice =6 IndexedSeq.concat (7 _lattice.filter{case((x, y), _) => isEven

(x, y)}.par.map(spin => processSpin(_lattice, spin)),

8 _lattice.filter{case((x, y), _) => isOdd(

x, y)}.par.map(spin => processSpin(_lattice, spin))

9 )10 }

Listing 2. Parallel version of method processLattice from classIsingModel, using Scala Parallel Collections

An equivalent parallelization using ScalaCL is obtainedjust by replacing the par method by the cl one fromthe ScalaCL framework, thus enabling the code to run onGPUs.As automatic parallelization applies on determined

parts of the initial code, the percentage of the sequentialexecution time affected by the parallel declinations canbe computed through a profiler. Thanks to this tool,we were able to determine the theoretical benefits ofattempting to parallelize part of a given model. Thesefigures are presented in Table II, along with the intrinsicparameters at the heart of each model implementation.For more details about the purpose of these parameters,the interested reader can refer to the respective papersintroducing each model thoroughly.

IV. Results

In this section, we will study how the different modelsimplementations behaved when run on a set of differentarchitectures. All the platforms that were used in thebenchmark are listed hereafter:

• 2-CPU 8-core Intel Xeon E5630 (Westmere architec-ture) running at 2.53GHz with ECC-memory enabled

• 8-CPU 8-core Intel Xeon E7-8837 running at 2.67GHz• 4-core 2-thread Intel i7-2600K running at 3.40GHz• NVIDIA GeForce GTX 580 (Fermi architecture)• NVIDIA Tesla C2075 (Fermi architecture)• NVIDIA Tesla K20 (Kepler architecture)

We compare a sequential Scala implementation executedon a single core of the Intel Xeon E5630 with its ScalaParallel Collections and ScalaCL declinations executedon all the platforms. Each model had to process thesame input configurations for this benchmark. Measuresresulting from these runs are displayed in Table III.Unfortunately, at the time of writing ScalaCL is not

able to translate an application as complex as Schelling’ssegregation model to OpenCL yet. In the meantime, whileScalaCL managed to produce OpenCL versions of the GapModel and the Ising Model, these two declinations werekilled by the system before being able to complete theirexecution.As a consequence, we are not able to provide any result

about a ScalaCL implementation of these models. Thisleads to the first result of this work: at the time of writing,ScalaCL cannot be used to build parallel prototypes ofsimulations on GPU. It is just a proof of concept that

410

Model Part of the Sequential ExecutionTime Subject to Parallelization Intrinsic Parameters

Gap Model 70% Map of 584x492 cells

Ising 38% Map of 2048x2048 cells;Threshold = 0.5

Schelling 50%Map of 500x500 cells;

Part of free cells at initialization = 2%;Equally-sized B/W population;

Neighbourhood = 2 cells

Table IICharacteristics of the three studied models

ModelSequential

(XeonE5630)

ScalaPC(i7)

ScalaPC(Xeon

E7-8837)

ScalaPC(XeonE5630)

ScalaCL(GTX)

ScalaCL(C2075)

ScalaCL(Kepler)

Gap Model 150.04 75.67 128 109.42 killed bysystem

killed bysystem

killed bysystem

Ising 45.35 11.63 33 26.83 killed bysystem

killed bysystem

killed bysystem

Schelling 2961.52 525.63 344 412.86 X X X

Table IIIExecution times in seconds of the three models on several parallel platforms (ScalaPC stands for Scala Parallel

Collections)

Scala closures can be transformed to OpenCL code, but itis not reliable enough to achieve such transformations onreal simulation codes yet.

On the other hand, the Scala Parallel Collection APIsucceeded to build a parallel declination of the threebenchmarked models. These declinations have successfullyrun on the CPU architectures retained in the benchmark,and have shown a potential performance gain when paral-lelizing the 3 studied models.

Let us validate this trend by now comparing handcraftedimplementations on CPU and GPU. For the sake ofbrevity, we will focus on the Forest Gap Model, whichScala Parallel Collections implementation runs twice asfast as the sequential implementation on the Intel i7,according to Table III. The CPU declination uses aJava Thread Pool, whereas the GPU is leveraged usingOpenCL. Figure 3 shows the speed-ups obtained with thedifferent parallelizations of the Forest Gap Model.

Results from Figure 3 show that an handcrafted paral-lelization on CPU follows the trend of the parallel proto-type in terms of performance gain. In view of the lowerspeed-up obtained on GPU using an OpenCL implemen-tation, it is likely that a automatically built GPU-enabledprototype would have displayed worse performance thanits CPU counterpart.

Not only this last result show that the automatic proto-types approach gives significant results, when using ScalaParallel Collections, but this also validates the relevancyof the whole approach when considering the characteristicsof the involved simulation model. In Table I, the ForestGap Model had been stated as a model performing moredata accesses than computations. Thus, it is logical than

Figure 3. Speed-up obtained for the Gap Model depending on theunderlying technique of parallelization

a parallelization attempt on GPU suffers from the latencyrelated to memory accesses on this kind of architecture.Indeed GPUs can only be fully exploited by applicationswith a high computational workload. Here, the OpenCLdeclination performs slightly faster than the sequential one(1.11X), but it is the perfect case of the unsuited archi-tecture choice that leads to disappointing performance, asexposed in Introduction.

Automatically built parallel prototypes quickly put thelight on this weakness, and avoid wasting time to developa GPU implementation that would be less efficient thanthe parallel CPU version.

In the same way, we can see from Table III that thenumber of available cores is not the only thing to considerwhen selecting a target platform. Depending on the char-acteristics of the model, which are summed up in Table

411

I, parallel prototypes behave differently when faced tomanycore architectures they cannot fully exploit due tofrequent memory accesses. The perfect example of thistrend is the results obtained when running the models onthe Xeon E5630, ECC-enabled host. As fast as it can be,this CPU-based host is significantly slowed down by itsECC memory. Indeed, ECC is known to impact executiontimes of about 20%.

V. Conclusion

In this work we have benchmarked two Scala propos-als aiming at automatically parallelizing applications andtheir ability to help simulation practitioners to figure outthe best target platform to parallelize their simulationon. The two frameworks have been detailed and com-pared: Scala parallel collections that automatically createstasks, and execute them in parallel thanks to a poolof worker threads; ScalaCL, part of the nativelibs4javaproject, which intends to produce OpenCL code fromScala closures, and to run it on the most efficient deviceavailable, be it GPU or multicore CPU.

Our study has stated that ScalaCL is still in its infancyand can only translate a limited set of Scala constructsto OpenCL at the time of writing. Although it is notable to produce a parallel prototype for a simulation yet,it deserves to be regarded as a future great languageextension if it manages to improve its reliability. Forits part, Scala Parallel Collections is a really satisfyingframework that mixes ease of use and efficiency.

Considering data-parallel simulations, this works showshow simulation practitioners can easily determine the bestparallel platform on which to concentrate their devel-opment efforts. Especially, this study puts forward theinefficiency of some architectures, in our case GPUs, whenfaced with problems they were not designed to processinitially. In the literature, some architectures have oftenbeen favored because of their cutting-edge characteristics.However, they might not be the best solution to retain andother solutions might be underrated [3]. Thus, being ableto quickly benchmark several architectures without furthercode rewriting is a great decision support tool. Such anapproach is very important when speed-up matters tomake sure that the underlying architecture is the mostsuited to fasten the problem.

In the future, we plan to propose metrics to set outthe benefits of automatically built prototypes using Scalacompared to handcrafted ones when considering the hu-man resources involved and the development time. In thisway, we will evaluate the possibilities to take part to thedevelopment of ScalaCL, to shorten the gap between thistool and the Scala Parallel Collections.

References[1] J. Caux, D. Hill, P. Siregar et al., “Accelerating 3d cellular

automata computation with gp-gpu in the context of integrativebiology,” Cellular Automata-Innovative Modelling for Scienceand Engineering, pp. 411–426, 2011.

[2] P. Topa and P. Młocek, “Gpgpu implementation of cellularautomata model of water flow,” Parallel Processing and AppliedMathematics, pp. 630–639, 2012.

[3] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D.Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Ham-marlund et al., “Debunking the 100x gpu vs. cpu myth: anevaluation of throughput computing on cpu and gpu,” in ACMSIGARCH Computer Architecture News, vol. 38, no. 3. ACM,2010, pp. 451–460.

[4] Khronos OpenCL Working Group, “The opencl specification1.2,” Khronos Group, Specification 1.2, November 2011.

[5] R. Dolbeau, S. Bihan, and F. Bodin, “Hmpp: A hybrid multi-core parallel programming environment,” in Workshop on Gen-eral Purpose Processing on Graphics Processing Units (GPGPU2007), 2007.

[6] A. Papakonstantinou, K. Gururaj, J. Stratton, D. Chen,J. Cong, and W. Hwu, “Fcuda: Enabling efficient compilationof cuda kernels onto fpgas,” in IEEE 7th Symposium on Appli-cation Specific Processors, 2009. SASP’09. IEEE, 2009, pp.35–42.

[7] M. Amini, F. Coelho, F. Irigoin, and R. Keryell, “Static compila-tion analysis for host-accelerator communication optimization,”in The 24th International Workshop on Languages and Compil-ers for Parallel Computing, Fort Collins, Colorado, September2011.

[8] K. Karimi, N. Dickson, and F. Hamze, “A performancecomparison of cuda and opencl,” http://arxiv.org/abs/1005.2581v3, August 2010, submitted. [Online]. Available: http://arxiv.org/abs/1005.2581v3

[9] C. Okasaki, Purely functional data structures. Cambridge UnivPr, 1999.

[10] A. Prokopec, P. Bagwell, T. Rompf, and M. Odersky, “A genericparallel collection framework,” in Euro-Par 2011 Parallel Pro-cessing. Springer, 2011, pp. 136–147.

[11] D. Lea, “A java fork/join framework,” in Proceedings of theACM 2000 conference on Java Grande. ACM, 2000, pp. 36–43.

[12] G. Cong, S. Kodali, S. Krishnamoorthy, D. Lea, V. Saraswat,and T. Wen, “Solving large, irregular graph problems usingadaptive work-stealing,” in Parallel Processing, 2008. ICPP’08.37th International Conference on. IEEE, 2008, pp. 536–545.

[13] E. Ising, “Beitrag zur theorie des ferromagnetismus,” Zeitschriftfür Physik A Hadrons and Nuclei, vol. 31, pp. 253–258, 1925, 10.1007/BF02980577. [Online]. Available: http://dx.doi.org/10.1007/BF02980577

[14] J. Passerat-Palmbach, A. Forest, J. Pal, B. Corbara, andD. Hill, “Automatic parallelization of a gap model using javaand opencl,” in Proceedings of the European SImulation andModeling Conference (ESM), 2012, to be published.

[15] T. C. Schelling, “Dynamic models of segregation,” The Journalof Mathematical Sociology, vol. 1, no. 2, pp. 143–186, 1971.[Online]. Available: http://www.tandfonline.com/doi/abs/10.1080/0022250X.1971.9989794

[16] M. Odersky, L. Spoon, and B. Venners, Programming in Scala.Artima Incorporated, 2008.

[17] P. Haller and M. Odersky, “Scala actors: Unifying thread-basedand event-based programming,” Theoretical Computer Science,vol. 410, no. 2, pp. 202–220, 2009.

[18] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller,E. Teller et al., “Equation of state calculations by fast comput-ing machines,” The journal of chemical physics, vol. 21, no. 6,p. 1087, 1953.

[19] T. Preis, P. Virnau, W. Paul, and J. Schneider, “Gpu acceleratedmonte carlo simulation of the 2d and 3d ising model,” Journalof Computational Physics, vol. 228, no. 12, pp. 4468–4477, 2009.

412

[IEEE 2013 International Conference on High Performance Computing & Simulation (HPCS) - Helsinki,...

Documents

Transcript of [IEEE 2013 International Conference on High Performance Computing & Simulation (HPCS) - Helsinki,...