A DSEL for Addressing the Problems Posed by Parallel Architectures

8
A DSEL for Addressing the Problems Posed by Parallel Architectures Jason M c Guiness, Colin Egan CTCA, School of Computer Science, University of Hertfordshire Hatfield, Hertfordshire, UK [email protected] 1. INTRODUCTION Computers with multiple pipelines have become increas- ingly prevalent, hence a rise in the available parallelism to the programming community. For example the dual-core desktop workstations to multiple core, multiple processors within blade frames which may contain hundreds of pipelines in data centres, to state-of-the-art mainframes in the Top500 supercomputer list with thousands of cores and the poten- tial arrival of next-generation cellular architectures that may have millions of cores. This surfeit of hardware parallelism has apparently yet to be tamed in the software architecture arena. Various attempts to meet this challenge have been made over the decades, taking such approaches as languages, compilers or libraries to enable programmers to enhance the parallelism within their various problem domains. Yet the common folklore in computer science has still been that it is hard to program parallel algorithms correctly. This paper examines what language features would be re- quired to add to an existing imperative language that would have little if no native support for implementing parallel- ism apart from a simple library that exposes the OS-level threading primitives. The goal of the authors has been to create a minimal and orthogonal DSEL that would add the capabilities of parallelism to that target language. Moreover the DSEL proposed will be demonstrated to have such use- ful guarantees as a correct, heuristically efficient schedule. In terms of correctness the DSEL provides guarantees that it can provide deadlock-free and race-condition free sched- ules. In terms of efficiency, the schedule produced will be shown to add no worse than a poly-logarithmic order to the algorithmic run-time of the schedule of the program on a CREW-PRAM (Concurrent-Read, Exclusive-Write, Paral- lel Random-Access Machine [19]) or EREW-PRAM (Exclusive- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00. Read EW-PRAM [19]) computation model. Furthermore the DSEL described assists the user with regard to debugging the resultant parallel program. An implementation of the DSEL in C++ exists: further details may be found in [12]. 2. RELATED WORK From a hardware perspective, the evolution of computer architectures has been heavily influenced by the von Neu- mann model. This has meant that with the relative increase in processor speed vs. memory speed, the introduction of memory hierarchies [3] and out-of-order instruction schedul- ing has been highly successful. However, these extra levels increase the penalty associated with a miss in the memory- subsystem, due to memory-access times, limiting the ILP (Instruction-Level Parallelism). Also there may be an in- crease in design complexity and power consumption of the overall system. An approach to avoid this problem may be to fetch sets of instructions from different memory banks, i.e. introduce threads, which would allow an increase in ILP, in proportion to the number of executing threads. From a software perspective, the challenge that has been presented to programmers by these parallel architectures has been the massive parallelism they expose. There has been much work done in the field of parallelizing software: Auto-parallelizing compilers: such as EARTH-C [17]. Much of the work developing auto-parallelizing com- pilers has derived from the data-flow community [16]. Language support: such as Erlang [20], UPC [5] or Intel’s [18] and Microsoft’s C++ compilers based upon OpenMP. Library support: such as POSIX threads (pthreads ) or Win32, MPI, OpenMP, Boost, Intel’s TBB [14], Cilk [10] or various libraries targeting C++ [6, 2]. Intel’s TBB has higher-level threading constructs, but it has not supplied parallel algorithms, nor has it provided any guarantees regarding its library. It also suffers from mixing code relating to generating the parallel schedule and the business logic, which would also make testing more complex. These have all had varying levels of success, as discussed in part in [11], with regards to addressing the issues of pro- gramming effectively for such parallel architectures. 3. MOTIVATION The basic issues addressed by all of these approaches have been: correctness or optimization. So far it has appeared

description

 

Transcript of A DSEL for Addressing the Problems Posed by Parallel Architectures

Page 1: A DSEL for Addressing the Problems Posed by Parallel Architectures

A DSEL for Addressing the Problems Posed by ParallelArchitectures

Jason McGuiness, Colin Egan

CTCA, School of Computer Science, University ofHertfordshire Hatfield, Hertfordshire, [email protected]

1. INTRODUCTIONComputers with multiple pipelines have become increas-

ingly prevalent, hence a rise in the available parallelism tothe programming community. For example the dual-coredesktop workstations to multiple core, multiple processorswithin blade frames which may contain hundreds of pipelinesin data centres, to state-of-the-art mainframes in the Top500supercomputer list with thousands of cores and the poten-tial arrival of next-generation cellular architectures that mayhave millions of cores. This surfeit of hardware parallelismhas apparently yet to be tamed in the software architecturearena. Various attempts to meet this challenge have beenmade over the decades, taking such approaches as languages,compilers or libraries to enable programmers to enhance theparallelism within their various problem domains. Yet thecommon folklore in computer science has still been that itis hard to program parallel algorithms correctly.

This paper examines what language features would be re-quired to add to an existing imperative language that wouldhave little if no native support for implementing parallel-ism apart from a simple library that exposes the OS-levelthreading primitives. The goal of the authors has been tocreate a minimal and orthogonal DSEL that would add thecapabilities of parallelism to that target language. Moreoverthe DSEL proposed will be demonstrated to have such use-ful guarantees as a correct, heuristically efficient schedule.In terms of correctness the DSEL provides guarantees thatit can provide deadlock-free and race-condition free sched-ules. In terms of efficiency, the schedule produced will beshown to add no worse than a poly-logarithmic order to thealgorithmic run-time of the schedule of the program on aCREW-PRAM (Concurrent-Read, Exclusive-Write, Paral-lel Random-Access Machine[19]) or EREW-PRAM (Exclusive-

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.

Read EW-PRAM [19]) computation model. Furthermore theDSEL described assists the user with regard to debuggingthe resultant parallel program. An implementation of theDSEL in C++ exists: further details may be found in [12].

2. RELATED WORKFrom a hardware perspective, the evolution of computer

architectures has been heavily influenced by the von Neu-mann model. This has meant that with the relative increasein processor speed vs. memory speed, the introduction ofmemory hierarchies [3] and out-of-order instruction schedul-ing has been highly successful. However, these extra levelsincrease the penalty associated with a miss in the memory-subsystem, due to memory-access times, limiting the ILP(Instruction-Level Parallelism). Also there may be an in-crease in design complexity and power consumption of theoverall system. An approach to avoid this problem may beto fetch sets of instructions from different memory banks, i.e.introduce threads, which would allow an increase in ILP, inproportion to the number of executing threads.

From a software perspective, the challenge that has beenpresented to programmers by these parallel architectures hasbeen the massive parallelism they expose. There has beenmuch work done in the field of parallelizing software:

• Auto-parallelizing compilers: such as EARTH-C [17].Much of the work developing auto-parallelizing com-pilers has derived from the data-flow community [16].

• Language support: such as Erlang [20], UPC [5] orIntel’s [18] and Microsoft’s C++ compilers based uponOpenMP.

• Library support: such as POSIX threads (pthreads) orWin32, MPI, OpenMP, Boost, Intel’s TBB [14], Cilk[10] or various libraries targeting C++ [6, 2]. Intel’sTBB has higher-level threading constructs, but it hasnot supplied parallel algorithms, nor has it providedany guarantees regarding its library. It also suffersfrom mixing code relating to generating the parallelschedule and the business logic, which would also maketesting more complex.

These have all had varying levels of success, as discussed inpart in [11], with regards to addressing the issues of pro-gramming effectively for such parallel architectures.

3. MOTIVATIONThe basic issues addressed by all of these approaches have

been: correctness or optimization. So far it has appeared

Page 2: A DSEL for Addressing the Problems Posed by Parallel Architectures

that the compiler and language based approaches have beenthe only approaches able to address both of those issuestogether. But the language-based approaches require thatprogrammers would need to re-implement their programs ina potentially novel language, a change that has been veryhard for business to adopt, severely limiting the use of theseapproaches.

Amongst the criticisms raised regarding the use of libraries[11, 13] such as pthreads, Win32 or OpenMP have been:

• They have been too low-level so using them to writecorrect multi-threaded programs has been very hard;it suffers from composition problems. This problemmay be summarized as: atomic access to an objectwould be contained within each object (using classicOOD), thus when composing multiple objects, mul-tiple separate locks, from the different objects, haveto be manipulated to guarantee correct access. If thiswere done correctly the usual outcome has been a ser-ious reduction in scalability.

• A related issue has been that that the programmeroften intimately entangles their thread-safety, threadscheduling and the business logic of their code. Thismeans that each program would be effectively a be-spoke program, requiring re-testing of each programfor threading issues as well as business logic issues.

• Also debugging such code has been found to be veryhard. Debuggers for multi-threaded code have been anopen area of research for some time.

Given that the language has to be immutable, a DSEL definedby a library that attempts to support the correctness andoptimality of the language and compiler approaches andyet somehow overcomes the limitations of the usual library-based approaches would seem to be ideal. This DSEL willnow be presented.

4. THE DSEL TO ASSIST PARALLELISMWe chose to address these issues by defining a carefully

crafted DSEL, then examining it’s properties to demonstratethat the DSEL achieved the goals. The DSEL should havethe following properties:

• The DSEL shall target what may be termed as gen-eral purpose threading, the authors define this to bescheduling in which the conditions or loop-bounds maynot be computed at compile-time, nor could they berepresented as monads, so could not be memoized1. Inparticular the DSEL shall support both data-flow anddata parallel constructs.

• By being implemented in an existing language it wouldavoid the necessity of re-implementing the programs, amore progressive approach to adoption could be taken.

• It shall be a reasonably small DSEL, but be largeenough provide sufficient extensions to the host lan-guage that express parallel constructs in a manner thatwould be natural to a programmer using that language.

1A compile or run-time optimisation technique involving aspace-time tradeoff. Re-computation of pure functions whenprovided with the same arguments may be avoided by cach-ing the result; the result will be the same for each call withthe same arguments, if the function has no side-effects.

• It shall assist in debugging any use of a conformingimplementation.

• It should provide guarantees regarding those bane’s ofparallel programming: dead-locks and race-conditions.

• Moreover it should provide guarantees regarding thealgorithmic complexity of any parallel schedule it wouldgenerate.

Initially a description of the grammar will be given, followedby a discussion of some of the properties of the DSEL. Fi-nally some theoretical results derived from the grammar ofthe DSEL will be given.

4.1 Detailed Grammar of the DSELThe various types, production rules and operations that

define the DSEL will be given in this section. The basictypes will be defined first, then the operations upon thosetypes will be defined. C++ has been chosen as the targetlanguage in which to implement the DSEL. This was dueto the rich ability within C++ to extend the type systemat compile-time: primarily using templates but also over-loading various operators. Hence the presentation of thegrammar relies on the grammar of C++, so it would assistthe reader to have familiarity of that grammar, in particularAnnex A of the ISO C++ Standard [8]. Although C++11has some support for threading, this had not been widely im-plemented at the time of writing, moreover the specificationhad not addressed the points of the DSEL in this paper.

Some clarifications:

• The subscriptopt means that the keyword is optional.

• The subscriptdef means that the keyword is default andspecifies the default value for the optional keyword.

4.1.1 TypesThe primary types used within the DSEL are derived from

the thread-pool type.

1. Thread pools can be composed with various subtypesthat could be used to fundamentally affect the imple-mentation and performance of any client software:

thread-pool-type:thread_pool work-policy size-policy pool-adaptor

• A thread pool would contain a collection ofthreads that may be more, less or the same asthe number of processors on the target archi-tecture. This allows for implementations tovisualize the multiple cores available or makeuse of operating-system provided thread im-plementations. An implementation may chooseto enforce a synchronization of all threadswithin the pool once an instance of that poolshould be destroyed, to ensure that threadsmanaged by the pool are appropriately des-troyed and work in the process of mutationcould be appropriately terminated.

work-policy: one ofworker_threads_get_work one_thread_distributes

• The library should implement the classic work-stealing or master-slave work sharing algorithms.

Page 3: A DSEL for Addressing the Problems Posed by Parallel Architectures

Clearly the specific implementation of thesecould affect the internal queue containing un-processed work within the thread_pool. Forexample a worker_threads_get_work queuemight be implemented such that the additionof work would be independent to the removalof work.

size-policy: one offixed_size tracks_to_max infinite

• The size-policy when used in combination withthe threading-model could be used to makeconsiderable simplifications in the implement-ation of the thread-pool-type which could makeit faster on certain architectures.

• tracks_to_max would implement some modelof the cost of re-creating and maintaining threads.If thread were cheap to create & destroy withlittle overhead, then an infinite size mightbe a reasonable approximation, conversely threadswith opposite characteristics might be bettermaintained in a fixed_size pool.

pool-adaptor:joinability api-type threading-model priority-modeoptcomparatoropt GSS(k)-batch-sizeopt

joinability: one ofjoinable nonjoinable

• The joinability has been provided to allowfor certain optimizations to be implement-able. A thread-pool-type that is nonjoinable

could have a number of simplifying detailsthat would make it not only easier to imple-ment but also faster in operation.

api-type: one ofno_api MS_Win32 posix_pthreads IBM_cyclops

• Both MS_Win32 and posix_pthreads are ex-amples of heavyweight_threading APIs inwhich threading at the OS-level would be madeuse of to implement the DSEL. IBM_cyclops

would be an implementation of the DSEL us-ing the lightweight_threading API imple-mented by IBM BlueGene/C Cyclops [1].

threading-model: one ofsequential_mode heavyweight_threading

lightweight_threading

• This specifier provides a coarse representa-tion of the various implementations of thread-able construct in the multitude of architec-tures available. For example Pthreads wouldbe considered to be heavyweight_threading

whereas Cyclops would be lightweight_threading.Separation of the threading model versus theAPI allows for the possibility that there maybe multiple threading APIs on the same plat-form, which may have different properties, forexample if there were to be a GPU availablein a multi-core computer, there could be twodifferent threading models within the sameprogram.

• The sequential_mode has been provided toallow implementations to removal all thread-ing aspects of all of the implementing library,which would hugely reduce the burden on theprogrammer regarding identifying bugs withintheir code. If all threading is removed, thenall bugs that remain, in principle should residein their user-code, which once determined tobe bug-free, could then be trivially parallel-ized by modifying this single specifier and re-compiling. Then any further bugs introducedwould be due to bugs within the parallel as-pects of their code, or the library implement-ing this DSEL. If the user relies upon the lib-rary to provide threading, then there shouldbe no further bugs in their code. We considerthis feature of paramount importance, as itdirectly addresses the complex task of debug-ging parallel software, by separating the al-gorithm by which the parallelism should beimplemented from the code implementing themutations on the data.

priority-mode: one ofnormal_fifodef prioritized_queue

• This is an optional parameter. The prior-

itized_queue would allow the user to spe-cify whether specific instances of work to bemutated should be performed ahead of otherinstances of work, according to a user-specifiedcomparator .

comparator:std::lessdef

• A unary function-type that specifies a strictweak-ordering on the elements within a theprioritized_queue.

GSS(k)-batch-size:1def

• A natural number specifying the batch-sizeto be used within the queue specified by thepriority-mode. The default is 1, i.e. no batch-ing would be performed. An implementa-tion would be likely to use this for enablingGSS(k) scheduling [9].

2. Adapted collections to assist in providing thread-safetyand also specify the memory access model of the col-lection:

safe-colln:safe_colln collection-type lock-type

• This adaptor wraps the collection-type andan instance of lock-type in one object, andprovides a few thread-safe operations uponthat collection, plus access to the underlyingcollection. This access might seem surpris-ing, but this has been done because lockingthe operations on collections has been shownto not be composable, and cross-cuts bothobject-orientated and functional-decompositiondesigns. This could could be open to misuse,but otherwise excessive locking would have

Page 4: A DSEL for Addressing the Problems Posed by Parallel Architectures

to be done in user code. This has not beenan ideal design decision, but a simple one,with scope for future work. Note that thisdesign choice within the DSEL does not inval-idate the rest of the grammar, as this wouldjust affect the overloads to the data-parallel-algorithms, described later.

• The adaptor also provides access to both read-lock and write-lock types, which may be thesame, but allow the user to specify the intentof their operations more clearly.

lock-type: one ofcritical_section_lock_type read_write

read_decaying_write

(a) A critical_section_lock_type would be asingle-reader, single-writer lock, a simulationof EREW semantics. The implementation ofthis type of lock could be more efficient oncertain architectures.

(b) A read_write lock is a multi-readers, single-write lock, a simulation of CREW semantics.

(c) A read_decaying_write lock would be a spe-cialization of a read_write lock that also im-plements atomic transformation of a write-lock into a read-lock.

(d) The lock should be used to govern the opera-tions on the collection, and not operations onthe items contained within the collection.

• The lock-type parameter may be used to spe-cify if EREW or CREW operations upon thecollection are allowed. For example if EREWoperations are only allowed, then overlappeddereferences of the execution_context res-ultant from parallel-algorithms operating uponthe same instance of a safe-colln should bestrictly ordered by an implementation to en-sure EREW semantics are maintained. Al-ternatively if CREW semantics were specifiedthen an implementation may allow read-operationsupon the same instance of the safe-colln tooccur in parallel, assuming they were not blockedby a write operation.

collection-type:A standard collection such as an STL-style list orvector, etc.

3. The thread-pool-type defines further sub-types for con-venience to the programmer:

create_direct:This adaptor, parametrized by the type of workto be mutated, contains certain sub-types. Theinput data and the mutation operation combinedare termed the work to be mutated, which wouldbe a type of closure. If the mutation operationdoes not change the state of any data external tothe closure, then this would be a type of monad.More specifically, this work to be mutated shouldalso be a type of functor that either:

(a) Provides a type result_type to access theresult of the mutation, and specifies the muta-tion member-function,

(b) or implements the function process(result_type

&), and the library may determine the actualtype of result_type.

The sub-types are:

joinable:A method of transferring work to be mutatedinto an instance of thread-pool-types. If thework to be mutated were to be transferredusing this modifier, then the return result ofthe transfer would be an execution_context,that may subsequently be used to obtain theresult of the mutation. Note that this impliesthat the DSEL implements a form of data-flow operation.

execution_context:This is the type of future that a transfer re-turns. It is also a type of proxy to the res-

ult_type that the mutation returns. Accessvia this proxy implicitly causes the callingthread to wait until the mutation has beencompleted. This is the other component ofthe DSEL that implements the data-flow model.Various sub-types of execution_context ex-ist specific to the result_types of the vari-ous operations that the DSEL supports. Notethat the implementation of execution_contextshould specifically prohibit aliasing instancesof these types, copying instances of these typesand assigning instances of these types.

nonjoinable:Another method of transferring work to be mutatedinto an instance of thread-pool-types. If the workto be mutated were to be transferred using thismodifier, then the return result of the transferwould be nothing. The mutation within the poolwould occur at some indeterminate time, the res-ult of which would, for example, be detectable byany side effects of the mutation within the res-

ult_type of the work to be mutated.

time_critical:This modifier ensures that when the work is mutatedby a thread within an instance of thread-pool-type into which it has been transferred, it willbe executed at an implementation-defined higherkernel priority. Other similar modifiers exist inthe DSEL for other kernel priorities. This ex-ample demonstrates that specifying other modifi-ers, that would be extensions to the DSEL, wouldbe possible.

cliques(natural_number n):This modifier is used with data-parallel-algorithms.It causes the instance of thread-pool-type to allowthe data-parallel-algorithm to operate with

⌈pn

⌉threads, where p is the number of threads in theinstance.

4. The DSEL specifies a number of other utility typessuch as shared_pointer, various exception types andexception-management adaptors amongst others. Thedetails of these important, but ancillary types has beenomitted for brevity.

Page 5: A DSEL for Addressing the Problems Posed by Parallel Architectures

4.1.2 Operators on the thread-pool-type

The various operations that are defined in the DSEL willnow be given. These operations tie together the types andexpress the restrictions upon the generation of the control-flow graph that the DSEL may create.

1. The transfer work to be mutated into an instance ofthread-pool-type is defined as follows:

transfer-future:execution-context-resultoptthread-pool-type transfer-operation

execution-context-result:execution_context <<

• The token sequence “<<” is the transfer oper-ation, and also used in the definition of thetransfer-modifier-operation, amongst other places.

• Note how an execution_context can only becreated via a transfer of work to be mutatedinto the suitably defined thread_pool. It isan error to transfer work into a thread_pool

that has been defined using the nonjoinable

subtype. There is no way to create an ex-

ecution_context with transferring work tobe mutated, so every execution_context isguaranteed to eventually contain the result ofa mutation.

transfer-operation:transfer-modifier-operationopt transfer-data-operation

transfer-modifier-operation:<< transfer-modifier

transfer-modifier: one oftime_critical joinable nonjoinable cliques

transfer-data-operation:<< transfer-data

transfer-data: one ofwork-to-be-mutated parallel-binary-operation data-parallel-algorithm

The details of the various parallel-binary-operations and data-parallel-algorithms will be given in the next section.

4.1.3 The Data-Parallel Operations and AlgorithmsThis section will describe the the various parallel algorithms

defined within the DSEL.

1. The parallel-binary-operations are defined as follows:

parallel-binary-operation: one ofbinary_fun parallel-logical-operation

parallel-logical-operation: one oflogical_and logical_or

• It is likely that an implementation would notimplement the usual short-circuiting of theoperands, to allow them to transferred intothe thread pool and executed in parallel.

2. The data-parallel-algorithms are defined as follows:

data-parallel-algorithm: one ofaccumulate copy count count_if fill fill_n

find find_if for_each min_element max_element

reverse transform

• The style and arguments of the data-parallel-algorithms is similar to those of the STL inthe C++ ISO Standard. Specifically they alltake a safe-colln as the arguments to spe-cify the ranges and functors as necessary asspecified within the STL. Note that these al-gorithms all use run-time computed bounds,otherwise it would be more optimal to usetechniques similar to those used in HPF ordescribed in [9] to parallelize such operations.If the DSEL supports loop-carried dependen-cies in the functor argument is undefined.

• If algorithms were to be implemented usingtechniques described in [7] and [4], then thealgorithms would be optimal with O (log (p))complexity in distributing the work to thethread pool. Given that there are no loop-carried dependencies, each thread may oper-ate independently upon a sub-range withinthe safe-colln for an optimal algorithmic

complexity of O(

np− 1 + log (p)

)where n is

the number of items to be computed and p isthe number of threads, ignoring the operationtime of the mutations.

3. The binary_funs are defined as follows:

binary fun:work-to-be-mutated work-to-be-mutated

binary functor

• A binary functor is just a functor that takestwo arguments. The order of evaluation ofthe arguments is undefined. If the DSEL sup-ports dependencies between the arguments isundefined. This would imply that the argu-ments should refrain from modifying any ex-ternal state.

4. Similarly, the logical operations are defined as follows:

logical operation:work-to-be-mutated work-to-be-mutated

binary functor

• Note that no short-circuiting of the compu-tation of the arguments occurs. The resultof mutating the arguments must be boolean.If the DSEL supports dependencies betweenthe arguments is undefined. This would im-ply that the arguments should refrain frommodifying any external state.

4.2 Properties of the DSELIn this section some results will be presented that derive

from the definitions, the first of which will demonstrate thatthe CFG (Control Flow Graph) would be a tree from whichthe other useful results will directly derive.

Theorem 1. Using the DSEL described above, the par-allel control-flow graph of any program that may use a con-forming implementation of the DSEL must be an acyclic dir-ected graph, and comprised of at least one singly-rooted tree,but may contain multiple singly-rooted, independent, trees.

Page 6: A DSEL for Addressing the Problems Posed by Parallel Architectures

Proof. From the definitions of the DSEL, the transferof work to be mutated into the thread_pool may be doneonly once according to the definition of transfer-future theresult of which returns a single execution_context accord-ing to the definition of execution-context-result which hasbeen the only defined way to create execution_contexts.This implies that from a node in the CFG, each transfer tothe thread-pool-type represents a single forward-edge con-necting the execution_context with the child-node thatcontains the mutation. The back-edge from the mutationto the parent-node is the edge connecting the result of themutation with the dereference of the execution_context.The execution_context and the dereference occur in thesame node, because execution_contexts cannot be passedbetween nodes, by definition. In summary: the parent-nodehas an edge from the execution_context it contains to themutation and a back-edge to the dereference in that parent-node. Each node may perform none, one or more trans-fers resulting in none, one or more child-nodes. A nodewith no children is a leaf-node, containing only a mutation.Now back-edges to multiple parent nodes cannot be created,according to the definition of execution_context, becauseexecution_contexts cannot be aliased nor copied betweennodes. So the only edges in this sub-graph are the forwardand back edges from parent to children. Therefore the sub-graph is not only acyclic, but a tree. Due to the definitionsof transfer-future and execution-context-result, the only wayto generate mutations is via the above technique. Thereforeeach child-node either returns via the back edge immedi-ately or generates a further sub-tree attaching to the largertree that contains it’s parent. Now if the entry-point ofthe program is the single thread that runs main(), i.e. thesingle root, this can only generate a tree, and each nodein the tree can only return or generate a tree, the wholeCFG must be a tree. If there were more entry-points, eachone can only generate a tree per entry-point, as the execu-

tion_contexts cannot be aliased nor copied between nodes,by definition.

According to the above corollary, one may appreciate that aconforming implementation of the DSEL would implementdata-flow in software.

Theorem 2. If the user refrains from using any otherthreading-related items or atomic objects other than thosedefined in the DSEL above then they can be guaranteed tohave a schedule free of race-conditions.

Proof. A race-condition is when two threads attempt toaccess the same data at the same time. A race-conditionin the CFG would be represented by a child node with twoparent nodes, with forward-edges connecting the parents tothe child. Note that the CFG must an acyclic tree accordingto theorem 1, then this sub-graph cannot be represented ina tree, so the schedule must be race-condition free.

Theorem 3. If the user refrains from using any otherthreading-related items or atomic objects other than thosedefined in the DSEL above and that the work they wish tomutate may not be aliased by any other object, then the usercan be guaranteed to have a schedule free of deadlocks.

Proof. A deadlock may be defined as: when threads Aand B wait on atomic-objects C and D, such that A locksC, waits upon D to unlock C and B locks D, waits upon C

to unlock D. In terms of the DSEL, this implies that exe-

cution_contexts C and D are shared between two threads.i.e. that an execution_context has been passed from anode A to a sibling node B, and vice-versa occurs to exe-

cution_context B. But aliasing execution_contexts hasbeen explicitly forbidden in the DSEL by definition 3.

Corollary 1. If the user refrains from using any otherthreading-related items or atomic objects other than thosedefined in the DSEL above and that the work they wish tomutate may not be aliased by any other object, then the usercan be guaranteed to have a schedule free of race-conditionsand deadlocks.

Proof. It must be proven that the two theorems 2 and 3are not mutually exclusive. Let us suppose that a CFG ex-ists that satisfies 2 but not 3. Therefore there must be eitheran edge formed by aliasing an execution_context or a back-edge from the result of a mutation back to a dereference ofan execution_context. The former has been explicitly for-bidden in the DSEL by definition of the execution_context,3, the latter forbidden by the definition of transfer-future, 1.Both are a contradiction, therefore such a CFG cannot exist.Therefore any conforming CFG must satisfy both theorems2 and 3.

Theorem 4. If the user refrains from using any otherthreading-related items or atomic objects other than thosedefined in the DSEL above then the schedule of work to bemutated by a conforming implementation of the DSEL wouldbe executed in time taking at least an algorithmic complexityof O (log (p)) and at most O (n) in units of time to mutatethe work where n is the number of work items to be mutatedon p processors. The algorithmic order of the minimal timewould be poly-logarithmic, so within NC, therefore at leastoptimal.

Proof. Given that the schedule must be a tree accordingto theorem 1 with at most n leaf-nodes, and that each node

takes at most O(

np− 1 + log (p)

)computations according

to the definition of the parallel-algorithms. Also it has beenproven in [7] that to distribute n items of work onto p pro-cessors may be performed with an algorithmic complexity ofO (log (n)). The fastest computation time would be if theschedule were a balanced tree, where the computation timewould be the depth of the tree, i.e. O (log (n)) in the sameunits. If the n items of work were to be greater than thep processors, then O (log (p)) ≤ O (log (n)), so the compu-tation time would be slower than O (log (p)). The slowestcomputation time would be if the tree were a chain, i.e.O (n) time. In those cases this implies that a conformingimplementation should add at most a constant order to theexecution time of the schedule.

4.3 Some Example UsageThese are two toy examples, based upon an implement-

ation in [12], of how the above DSEL might appear. Thefirst example is a data-flow example showing how the DSELcould be used to mutate some work on a thread within thethread pool, effectively demonstrating how the future wouldbe waited upon. Note how the execution_context has beencreated via the transfer of work into the thread_pool.

Listing 1: Data-flow example of a Thread Pool andFuture.

Page 7: A DSEL for Addressing the Problems Posed by Parallel Architectures

s t ru c t r e s t {i n t i ;

} ;s t r u c t work type {

void proce s s ( r e s t &) {}} ;

typedef ppd : : thread poo l<p o o l t r a i t s : : worker threads get work ,p o o l t r a i t s : : f i x e d s i z e ,poo l adaptor<

g e n e r i c t r a i t s : : j o inab l e , p la t fo rm ap i ,heavyweight thread ing

>> poo l type ;typedef poo l type : : c r e a t e d i r e c t <work type> c r e a t o r t ;typedef c r e a t o r t : : e x e cu t i on cont ex t execu t i on cont ex t ;typedef c r e a t o r t : : j o i n ab l e j o i n ab l e ;

poo l type pool ( 2 ) ;exe cu t i on cont ex t context ( pool<<j o i n ab l e ()<<work type ( ) ) ;context−>i ;

The typedefs in this example implementation of the gram-mar are complex, but the typedef for the thread-pool-typewould only be needed once and, reasonably, could be heldin a configuration trait in header file.

The second example shows how a data-parallel version ofthe C++ accumulate algorithm might appear.

Listing 2: Example of a parallel version of an STLalgorithm.typedef ppd : : thread poo l<

p o o l t r a i t s : : worker threads get work ,p o o l t r a i t s : : f i x e d s i z e ,poo l adaptor<

g e n e r i c t r a i t s : : j o inab l e , p la t fo rm ap i ,heavyweight threading ,p o o l t r a i t s : : no rma l f i f o , std : : l e s s , 1

>> poo l type ;typedef ppd : : s a f e c o l l n <

vector<int >, l o c k t r a i t s : : c r i t i c a l s e c t i o n l o c k t y p e> v t r c o l l n t ;typedef poo l type : : accumulate t<

v t r c o l l n t>:: e x e cu t i on cont ex t execu t i on cont ex t ;v t r c o l l n t v ;v . push back ( 1 ) ; v . push back ( 2 ) ;exe cu t i on cont ex t context (

pool<<j o i n ab l e ( )<<pool . accumulate (

v , 1 , std : : plus<v t r c o l l n t : : va lue type >())

) ;a s s e r t (∗ context==4);

All of the parameters have been specified in the thread-pool-type to demonstrate the appearance of the typedef. Notethat the example illustrates a map-reduce operation, an im-plementation might:

1. take sub-ranges within the safe-colln,

2. which would be distributed across the threads withinthe thread_pool,

3. the mutations upon each element within each sub-range would be performed sequentially, their resultscombined via the accumulator functor, without lock-ing any other thread’s operation,

4. These sub-results would be combined with the final ac-cumulation, in this the implementation providing suit-able locking to avoid any race-condition,

5. The total result would be made available via the exe-

cution_context.

Moreover the size of the input collection should be suffi-ciently large or the time taken to execute the operation ofthe accumulator so long, so that the cost of the above oper-ations would be reasonably amortized.

5. CONCLUSIONSThe goals of the paper has been achieved: a DSEL has

been formulated:

• that may be used to expresses general-purpose paral-lelism within a language,

• ensures that there are no deadlocks and race condi-tions within the program if the programmer restrictsthemselves to using the constructs of the DSEL,

• and does not preclude implementing optimal scheduleson a CREW-PRAM or EREW-PRAM computationmodel.

Intuition suggests that this result should have come as nosurprise considering the work done relating to auto-parallelizingcompilers, which work within the AST and CFGs of theparsed program[17].

It is interesting to note that the results presented herewould be applicable to all programming languages, com-piled or interpreted, and that one need not be forced tore-implement a compiler. Moreover the DSEL has been de-signed to directly address the issue of debugging any suchparallel program, directly addressing this problematic is-sue. Further advantages of this DSEL are that program-mers would not need to learn an entirely new programminglanguage, nor would they have to change to a novel com-piler implementing the target language, which may not beavailable, or if it were might be impossible to use for moreprosaic business reasons.

6. FUTURE WORKThere are a number of avenues that arise which could be

investigated, for example a conforming implementation ofthe DSEL could be presented, for example [12]. The prop-erties of such an implementation could then be investigatedby reimplementing a benchmark suite, such as SPEC2006[15], and comparing and contrasting the performance of thatimplementation versus the literature. The definition of safe-colln has not been an optimal design decision a better ap-proach would have been to define ranges that support lock-ing upon the underlying collection. Extending the DSELmay be required to admit memoization could be investig-ated, such that a conforming implementation might imple-ment not only inter but intra-procedural analysis.

7. REFERENCES[1] Almasi, G., Cascaval, C., Castanos, J. G.,

Denneau, M., Lieber, D., Moreira, J. E., andHenry S. Warren, J. Dissecting Cyclops: a detailedanalysis of a multithreaded architecture. SIGARCHComput. Archit. News 31, 1 (2003), 26–38.

[2] Bischof, H., Gorlatch, S., Leshchinskiy, R., andMuller, J. Data Parallelism in C++ TemplatePrograms: a Barnes-hut Case Study. ParallelProcessing Letters 15, 3 (2005), 257–272.

[3] Burger, D., Goodman, J. R., and Kagi, A.Memory Bandwidth Limitations of FutureMicroprocessors. In ISCA (1996), pp. 78–89.

[4] Casanova, H., Legrand, A., and Robert, Y.Parallel Algorithms. Chapman & Hall/CRC Press,2008.

Page 8: A DSEL for Addressing the Problems Posed by Parallel Architectures

[5] El-ghazawi, T. A., Carlson, W. W., andDraper, J. M. UPC language specifications v1.1.1.Tech. rep., 2003.

[6] Giacaman, N., and Sinnen, O. Parallel iterator forparallelising object oriented applications. InSEPADS’08: Proceedings of the 7th WSEASInternational Conference on Software Engineering,Parallel and Distributed Systems (Stevens Point,Wisconsin, USA, 2008), World Scientific andEngineering Academy and Society (WSEAS),pp. 44–49.

[7] Gibbons, A., and Rytter, W. Efficient parallelalgorithms. Cambridge University Press, New York,NY, USA, 1988.

[8] ISO. ISO/IEC 14882:2011 Information technology —Programming languages — C++. InternationalOrganization for Standardization, Geneva,Switzerland, Feb. 2012.

[9] Kennedy, K., and Allen, J. R. Optimizingcompilers for modern architectures: adependence-based approach. Morgan KaufmannPublishers Inc., San Francisco, CA, USA, 2002.

[10] Leiserson, C. E. The Cilk++ concurrency platform.J. Supercomput. 51, 3 (Mar. 2010), 244–257.

[11] McGuiness, J. M. Automatic Code-GenerationTechniques for Micro-Threaded RISC Architectures.Master’s thesis, University of Hertfordshire, Hatfield,Hertfordshire, UK, July 2006.

[12] McGuiness, J. M. libjmmcg - implementing PPD.libjmmcg.sourceforge.net, July 2009.

[13] McGuiness, J. M., Egan, C., Christianson, B.,and Gao, G. The Challenges of EfficientCode-Generation for Massively Parallel Architectures.In Asia-Pacific Computer Systems ArchitectureConference (2006), pp. 416–422.

[14] Pheatt, C. Intel R©threading building blocks. J.Comput. Small Coll. 23, 4 (2008), 298–298.

[15] Reilly, J. Evolve or Die: Making SPEC’s CPU SuiteRelevant Today and Tomorrow. In IISWC (2006),p. 119.

[16] Snelling, D. F., and Egan, G. K. A ComparativeStudy of Data-Flow Architectures. Tech. Rep.UMCS-94-4-3, 1994.

[17] Tang, X. Compiling for Multithreaded Architectures.PhD thesis, University of Delaware, Delaware, USA,Fall 1999.

[18] Tian, X., Chen, Y.-K., Girkar, M., Ge, S.,Lienhart, R., and Shah, S. Exploring the Use ofHyper-Threading Technology for MultimediaApplications with Intel R©OpenMP* Compiler. InIPDPS (2003), p. 36.

[19] Tvrdik, P. Topics in parallel computing - PRAMmodels.http://pages.cs.wisc.edu/ tvrdik/2/html/Section2.html,January 1999.

[20] Virding, R., Wikstrom, C., and Williams, M.Concurrent programming in ERLANG (2nd ed.).Prentice Hall International (UK) Ltd., Hertfordshire,UK, UK, 1996.