ParaForming - Patterns and Refactoring for Parallel Programming

Paraforming: Forming Parallel (Func2onal) Programs from High-‐Level Pa:erns using Advanced Refactoring

Kevin Hammond, Chris Brown, Vladimir Janjic University of St Andrews, Scotland Build Stuff, Vilnius, Lithuania, December 10 2013

W: http://www.paraphrase-ict.eu!

T: @paraphrase_fp7, @khstandrews E: [email protected]‐andrews.ac.uk

The Present

3

Pound versus Dollar

The Future: “megacore” computers?

§  Hundreds of thousands, or millions, of (small) cores

5

Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core




What will “megacore” computers look like?

§  Probably not just scaled versions of today’s mul2core §  Perhaps hundreds of dedicated lightweight integer units §  Hundreds of floa9ng point units (enhanced GPU designs) §  A few heavyweight general-‐purpose cores §  Some specialised units for graphics, authen9ca9on, network etc §  possibly so* cores (FPGAs etc) §  Highly heterogeneous

6

What will “megacore” computers look like?

§  Probably not uniform shared memory §  NUMA is likely, even hardware distributed shared memory §  or even message-‐passing systems on a chip §  shared-‐memory will not be a good abstrac:on

7

int arr [x] [y];

8

��

Laki (NEC Nehalem Cluster) and hermit (XE6)

LakiI 700 dual socket Xeon 5560 2,8GHz

(“Gainestown”)I 12 GB DDR3 RAM / nodeI Infiniband (QDR)I 32 nodes with additional Nvidia Tesla

S1070I Scientific Linux 6.0

hermit (phase 1 step 1)I 38 racks with 96 nodes eachI 96 service nodes and 3552 compute

nodesI Each compute node will have 2 sockets

AMD Interlagos @ 2.3GHz 16 Coreseach leading to 113.664 cores

I Nodes with 32GB and 64GB memoryreflecting different user needs

I 2.7PB storage capacity @ 150GB/s IObandwidth

I External Access Nodes, Pre- &Postprocessing Nodes, RemoteVisualization Nodes

4/6 :: HLRS in ParaPhrase :: Turin, 4th/5th October 2011 ::

The Biggest Computer in the World

9

Tianhe-‐2, Chinese Na2onal University of Defence Technology 33.86 petaflops/s (June 17, 2013) 16,000 Nodes; each with 2 Ivy Bridge mul9cores and 3 Xeon Phis 3,120,000 x86 cores in total!!!

It’s not just about large systems

•  Even mobile phones are mul9core §  Samsung Exynos 5 Octa has 8 cores, 4 of

which are “dark”

•  Performance/energy tradeoffs mean systems will be increasingly parallel

•  If we don’t solve the mul9core challenge, then no other advances will maber!

10

ALL Future Programming will be

Parallel!

The Manycore Challenge

“Ul9mately, developers should start thinking about tens, hundreds, and thousands of cores now in their algorithmic development and deployment pipeline.” Anwar Ghuloum, Principal Engineer, Intel Microprocessor Technology Lab

“The dilemma is that a large percentage of mission-‐cri9cal enterprise applica9ons will not ``automagically'' run faster on mul9-‐core servers. In fact, many will actually run slower. We must make it as easy as possible for applica9ons programmers to exploit the latest developments in mul9-‐core/many-‐core architectures, while s9ll making it easy to target future (and perhaps unan9cipated) hardware developments.” Patrick Leonard, Vice President for Product Development Rogue Wave Sobware

The ONLY important challenge in Computer Science (Intel)

Also recognised as thema9c priori9es by EU and na9onal funding bodies

But Doesn’t that mean millions of threads on a megacore machine??

13

How to build a wall

(with apologies to Ian Watson, Univ. Manchester)

How to build a wall faster

How NOT to build a wall

Task iden2fica2on is not the only problem… Must also consider Coordina9on, communica9on, placement, scheduling, …

Typical CONCURRENCY Approaches require the Programmer to solve these

We need structure We need abstrac2on We don’t need another brick in the wall

17

Thinking Parallel

§  Fundamentally, programmers must learn to “think parallel” §  this requires new high-‐level programming constructs

§  perhaps dealing with hundreds of millions of threads

§  You cannot program effec2vely while worrying about deadlocks etc. §  they must be eliminated from the design!

§  You cannot program effec2vely while fiddling with communica2on etc. §  this needs to be packaged/abstracted!

§  You cannot program effec2vely without performance informa2on §  this needs to be included as part of the design!

18

A Solu2on?

“The only thing that works for parallelism is func2onal programming”

Bob Harper, Carnegie Mellon University

λ

Parallel Func2onal Programming

§  No explicit ordering of expressions §  Purity means no side-‐effects

§  Impossible for parallel processes to interfere with each other §  Can debug sequen2ally but run in parallel §  Enormous saving in effort

§  Programmer concentrate on solving the problem §  Not por9ng a sequen9al algorithm into a (ill-‐defined) parallel domain

§  No locks, deadlocks or race condi2ons!! §  Huge produc2vity gains! λ λ

ParaPhrase Project: Parallel Pa:erns for Heterogeneous Mul2core Systems (ICT-‐288570), 2011-‐2014, €4.2M budget 13 Partners, 8 European countries

UK, Italy, Germany, Austria, Ireland, Hungary, Poland, Israel Coordinated by Kevin Hammond St Andrews

0

The ParaPhrase Approach

§  Start bobom-‐up §  iden9fy (strongly hygienic) COMPONENTS §  using semi-‐automated refactoring

§  Think about the PATTERN of parallelism §  e.g. map(reduce), task farm, parallel search, parallel comple9on, ...

§  STRUCTURE the components into a parallel program §  turn the pa?erns into concrete (skeleton) code §  Take performance, energy etc. into account (mul9-‐objec9ve op9misa9on) §  also using refactoring

§  RESTRUCTURE if necessary! (also using refactoring) 25

both legacy and new programs

Some Common Pa:erns

§  High-‐level abstract paberns of common parallel algorithms

35

Google map-‐reduce combines two of these!

Generally, we need to nest/combine

paberns in arbitray ways

The Skel Library for Erlang

§  Skeletons implement specific parallel paberns §  Pluggable templates

§  Skel is a new (AND ONLY!) Skeleton library in Erlang §  map, farm, reduce, pipeline, feedback §  instan9ated using skel:run

§  Fully Nestable

§  A DSL for parallelism !OutputItems = skel:run(Skeleton, InputItems).!! 36

chrisb.host.cs.st-‐andrews.ac.uk/skel.html

hbps://github.com/ParaPhrase/skel

Parallel Pipeline Skeleton

§  Each stage of the pipeline can be executed in parallel §  The input and output are streams

skel:run([{pipe,[Skel1, Skel2,..,SkelN]}], Inputs).!

!

37

Pipe

Tn · · · T1 T �n · · · T �

1

{pipe, [Skel1, Skel2, · · · , Skeln]}

Skel1 Skel2 Skeln· · ·

Inc = {seq , fun (X) -> X+1 end},

Double = {seq , fun (X) -> X*2 end},

skel:run({pipe , [Inc , Double]},

[1,2,3,4,5,6]).

% -> [4,6,8,10,12,14]

Farm Skeleton

§  Each worker is executed in parallel §  A bit like a 1-‐stage pipeline

!skel:do([{farm, Skel, M}], Inputs).!

38

Farm

Tn · · · T1 T �n · · · T �

1

...

Skel2

Skel1

{farm, Skel, M}

SkelM

Inc = {seq , fun(X)-> X+1 end},

skel:run({farm , Inc , 3},

[1,2,3,4,5,6]).

% -> [2,5,3,6,4,7]

39

1 2 4 8 12 16 20 24

2

4

6

8

10

12

14

16

18

20

22

24

No. cores

Speedup

Speedups for Matrix Multiplication

Naive Parallel

Farm

Farm with Chunk 16

Using The Right Pa:ern Ma:ers

The ParaPhrase Approach

Erlang Java

Cos9ng/Profiling

Erlang Java

Generic Pa:ern Library

AMD Opteron

AMD Opteron

Intel Core

Intel Core

Nvidia GPU

Nvidia GPU

Intel GPU

Intel GPU

Nvidia Tesla

Intel Xeon Phi

Mellanox Infiniband

... Haskell

... Haskell Parallel Code

SequenGal Code

Refactoring

C/C++

C/C++

Refactoring

§  Refactoring changes the structure of the source code §  using well-‐defined rules §  semi-‐automa:cally under

programmer guidance

Review"

Refactoring: Farm Introduc2on

44

Farm

S1 � S2 ⌘ Pipe(S1, S2) pipe seqMap(S1 � S2, d, r) ⌘ Map(S1, d, r) �Map(S2, d, r) map fission/fusionS ⌘ Farm(S) farm intro/elimMap(F, d, r) ⌘ Pipe(Decomp(d)), Farm(F ), Recomp(r)) data2streamS1 ⌘ Map(S

01, d, r) map intro/elim

Figure 3.3: Some Standard Skeleton Equivalences

The following describes each of the patterns in turn:

• a MAP is made up of three OPERATIONs: a worker, a partitioner, and acombiner, followed by an INPUT;

• a SEQ is made up of a single OPERATION denoting the sequential compu-tation to be performed, followed by an INPUT;

• a FARM is made up of a single OPERATON denoting the working, an INPUTand the number of workers (NW); and,

• a PIPE is made up of at least one OPERATION denoting each stage of thepipeline, followed by an INPUT.

Each pattern has an IDENT attribute, allowing other patterns to reference it. Thisflattens the XML structure instead of nesting, providing a structure that is easier toread and reason about. Finally, an OPERATION is either a COMPONENT (describ-ing the module, name and language for a unit of computation) or another pattern(described by a reference to the pattern’s identifier).

3.2 Language-Independent Rewrite Rules

In this section we introduce a number of language-independent skeleton rewriterules that will be used to motivate the refactoring rules given in the remainder ofthe chapter. Figure 3.3 shows four well-known equivalences. Here Pipe, � andMap refer to the skeletons for parallel pipeline, function composition and parallelmap, respectively, S denotes any skeleton, and Seq(e) denotes that e is a sequentialcomputation. All skeletons work over streams. Decomp and Recomp are primi-tive operations that work over non-streaming inputs. d and r are simple functionsthat specify a decomposition (d :: a ! [b]), and a recomposition (r :: [b] ! a),as with the Partition and Combine functions above. Working from left toright in these equivalences introduces parallelism, while working from right to leftreduces parallelism. Reading from left to right, we can therefore interpret the rulesas follows:

30

Image Processing Example

45

Read Image 1 Read Image 2

“White screening”

Merge Images

Write Image

{ Build

{ Build

Stuff }

{ Build Stuff }

Basic Erlang Structure

[ writeImage(convertMerge(readImage(X))) !! ! ! ! !|| X <- Images() ]!

!readImage({In1, in2, out) ->!

!…!!{ Image1, Image2, out}.!

!convertImage({Image1, Image2, out}) ->!

!Image1P = whiteScreen(Image1),!!Image2P = mergeImages(Image1, Image2),!!{Image2P, out}.!

!writeImage({Image, Out}) -> …!

46

Refactoring Demo

47

Refactoring Demo

48

Speedup Results (Image Processing)

50

1 2 4 8 12 16 20 24

124681012141618202224

No. Farm Workers

Speedup

Speedups for Haar Transform (Skel Task Farm)

1D Skel Task Farm1D Skel Task Farm with Chunk Size = 42D Skel Task Farm

Large-‐Scale Demonstrator Applica2ons

§  ParaPhrase tools are being used by commercial/end-‐user partners §  SCCH (SME, Austria) §  Erlang Solu9ons Ltd (SME, UK) §  Mellanox (Israel) §  ELTESos, Hungary (SME) §  AGH (University, Poland) §  HLRS (High Performance Compu9ng Centre, Germany)

Speedup Results (demonstrators)

55

1 2 4 6 8 10 12 14 16

1

2

4

6

8

10

No of �2 workers

Spee

dup

Speedups for Convolution �1(G) k �2(F )

�1 = 1

�1 = 2

�1 = 4

�1 = 6

�1 = 8

�1 = 10

1 2 4 6 8 10 12 14 16 18 20 22 24

1

2

4

6

8

10

12

14

16

18

20

22

24

No of Workers

Spee

dup

Speedups for Ant Colony, BasicN2 and Graphical Lasso

BasicN2

BasicN2 Manual

Graphical Lasso

Graphical Lasso Manual

Ant Colony Optimisation Manual

Ant Colony Optimisation

Figure 3. Refactored Use Case Results in FastFlow

code and simply points the refactoring tool towards them. Theactual parallelisation is then performed by the refactoring tool,supervised by the programmer. This can give significant sav-ings in effort, of about one order of magnitude. This is achievedwithout major performance losses: as desired, the speedupsachieved with the refactoring tool are approximately the sameas for full-scale manual implementations by an expert. Infuture we expect to develop this work in a number of newdirections, including adding advanced performance models tothe refactoring process, thus allowing the user to accuratelypredict the parallel performance from applying a particularrefactoring with a specified number of threads. This may beparticularly useful when porting the applications to differentarchitectures, including adding refactoring support for GPUprogramming in OpenCl. Also, once sufficient automisationof the refactoring tool is achieved, the best parametrisationregarding parallel efficiency can be determined via optimisa-tion, further facilitating this approach. In addition, we alsoplan to implement more skeletons, particularly in the field ofcomputer alegbra and physics, and demonstrate the refactoringapproach with these new skeletons on a wide range of realisticapplications. This will add to the evidence that our approach isgeneral, usable and scalable. Finally, we intend to investigatethe limits of scalability that we have obvserved for some of ouruse-cases, aiming to determine whether the limits are hardwareartefacts or algorithmic.

REFERENCES

[1] M. Aldinucci, M. Danelutto, P. Kilpatrick and M. Torquati. FastFlow:High-Level and Efficient Streaming on Multi-Core. ProgrammingMulti-core and Many-core Computing Systems. Parallel and DistributedComptuing. Chap. 13, 2013. Wiley.

[2] Michael P Allen. Introduction to Molecular Dynamics Simulation.Computational Soft Matter: From Synthetic Polymers to Proteins, 23:1–28, 2004.

[3] M. den Besten, T. Stuetzle, M. Dorigo. Ant Colony Optimization forthe Total Weighted Tardiness Problem PPSN 6, p611-620, Sept. 2000.

[4] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, and A. Elliott.Cost-Directed Refactoring for Parallel Erlang Programs. in Interna-tional Journal Parallel Processing. HLPP 2013 Special Issue. Springer.Paris, September 2013. DOI 10.1007/s10766-013-0266-5

[5] C. Brown, K. Hammond, M. Danelutto, and P. Kilpatrick. A Language-Independent Parallel Refactoring Framework. in Proc. of the FifthWorkshop on Refactoring Tools (WRT ’12)., Pages 54-58. ACM, NewYork, USA. 2012.

[6] C. Brown, H. Li, and S. Thompson. An Expression Processor: A CaseStudy in Refactoring Haskell Programs. Eleventh Symp. on Trends inFunc. Prog., May 2010.

[7] C. Brown, H. Loidl, and K. Hammond. Paraforming: Forming HaskellPrograms using Novel Refactoring Techniques. 12th Symp. on Trendsin Func. Prog., Spain, May 2011.

[8] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, H. Schöner,and T. Breddin. Paraphrasing: Generating Parallel Programs UsingRefactoring. In 10th International Symposium, FMCO 2011. Turin,Italy, October 3-5, 2011. Revised Selected Papers. Springer-Berlin-Heidelberg. Pages 237-256.

[9] R. M. Burstall and J. Darlington. A Transformation System forDeveloping Recursive Programs. J. of the ACM, 24(1):44–67, 1977.

[10] M. Cole. Algorithmic Skeletons: Structured Management of ParallelComputations. Research Monographs in Par. and Distrib. Computing.Pitman, 1989.

[11] M. Cole. Bringing Skeletons out of the Closet: A Pragmatic Manifestofor Skeletal Parallel Programming. Par. Computing, 30(3):389–406,2004.

[12] D. Dig. A Refactoring Approach to Parallelism. IEEE Softw., 28:17–22,January 2011.

[13] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse In-verse Covariance Estimation with the Graphical Lasso. Biostatistics,9(3):432–441, July 2008.

[14] R. Loogen, Y. Ortega-Mallén, and R. Peña-Marí. Parallel Func. Prog.in Eden. J. of Func. Prog., 15(3):431–475, 2005.

[15] T. Mens and T. Tourwé. A Survey of Software Refactoring. IEEETrans. Softw. Eng., 30(2):126–139, 2004.

[16] H. Partsch and R. Steinbruggen. Program Transformation Systems.ACM Comput. Surv., 15(3):199–236, 1983.

[17] K. Hammond, M. Aldinucci, C. Brown, F. Cesarini, M. Danelutto,H. Gonzalez-Velez, P. Kilpatrick, R. Keller, T. Natschlager, andG. Shainer. The ParaPhrase Project: Parallel Patterns for AdaptiveHeterogeneous Multicore Systems. FMCO. Feb. 2012.

[18] K. Hammond, J. Berthold, and R. Loogen. Automatic Skeletonsin Template Haskell. Parallel Processing Letters, 13(3):413–424,September 2003.

[19] W. Opdyke. Refactoring Object-Oriented Frameworks. PhD Thesis,Dept. of Comp Sci, University of Illinois at Urbana-Champaign, Cham-paign, IL, USA (1992).

[20] T. Sheard and S. P. Jones. Template Meta-Programming for Haskell.SIGPLAN Not., 37:60–75, December 2002.

[21] D. B. Skillicorn and W. Cai. A Cost Calculus for Parallel FunctionalProgramming. J. Parallel Distrib. Comput., 28(1):65–83, 1995.

[22] J. Wloka, M. Sridharan, and F. Tip. Refactoring for reentrancy. InESEC/FSE ’09, pages 173–182, Amsterdam, 2009. ACM.

Speedup close to or beHer than manual op9misa9on

Bow2e2: most widely used DNA alignment tool

56

14

16

18

20

22

24

26

28

20 30 40 50 60 70 80 90 100 110

Sp

ee

du

p

Read Length

Bt2FF-pin+intBt2

15

20

25

30

28 30 32 34 36 38 40

Sp

ee

du

p

Quality

Bt2FF-pin+intBt2

Original

Paraphrase

C. Misale. Accelera9ng Bow9e2 with a lock-‐less concurrency approach and memory affinity. IEEE PDP 2014. To appear.

Comparison of Development Times

58

figure. The convolution is defined as a two stage pipeline (k),with the first stage being a farm (�) that generates the images(G), and the second stage is a farm that filters the images (F ).In the convolution, the maximum speedup obtained from therefactored version is 6.59 with 2 workers in the first farm and8 workers in the second farm. There are also three workersfor each pipeline stage, (two for the farm stages, and one forthe initial streaming stage), plus threads for the load balancersin the farm, giving a total of 15 threads. Here, the nature ofthe application may limit the scalability. The second stage ofthe pipeline dominates the computation: the first stage takeson average 0.6 seconds and the second stage takes around7 seconds to process one image, resulting in a substantialbottleneck in the second stage.

The Ant Colony Optimisation scales linearly to give aspeedup of 7.29 using 8 farm workers, after which the per-formance starts to drop noticeably. We observe only relativelymodest speedups (between 3 and 4) with more than 12 farmworkers. The Ant Colony Optimisation is quite a memory-intense application, and all of the farm workers need to accessa large amount of shared data (processing times, deadlinesand weights of jobs, together with ⌧ matrix), especially sincewe are considering instances where the number of jobs to bescheduled is large. We suspect that the drop in performanceis due to expensive memory accesses for those farm workersthat are placed on remote cores (i.e., not in the same processorpackage). The fact that the decrease in performance occurs atabout 10 farm workers confirms this: this is exactly the pointwhere not all of the farm workers can be placed on cores fromone package3. However, more testing is needed to preciselypinpoint the reasons for the drop in performance.

For the BasicN2 use case (shown in Listing 4 in theright column), the refactored version achieves a speedup of21.2 with 24 threads. The application scales well, with nearlinear speedups (up to 11.15 with 12 threads). After 12threads, the speedups decrease slightly, most likely because therefactored code dynamically allocates memory for the tasksduring the computation, resulting in some small overhead.FastFlow also reserves two workers: one worker for the loadbalancer and one for the farm skeleton, so the maximumspeedup achievable with 24 threads is only 21.2. BasicN2gives scalable speedups due to its data-parallel structure, whereeach task is independent, and the main computation over eachtask dominates the computation. In Listing 4, we also showthe manually parallelised version in FastFlow, which achievescomparable speedups of 22.68 with 24 threads. The manualrefactored code achieves slightly better speedups due to thefact that only one FastFlow farm is introduced in the code.However, in the refactored version, due to the refactoring tool’slimitation, we introduce two FastFlow farms with an additionsynchronisation point between them. The refactoring tool doesnot yet have provision to merge two routines into one, whichcan be achieved by an experienced C++ programmer.

The Graphical Lasso use case gives a scalable speedup of9.8, for 16 cores, and stagnates afterwards. This is similar tomanually ported FastFlow code, and to results obtained withan OpenMP port (where we achieved a maximum speedup of11.3 on 16 cores). Although the tasks parallelised here are, inprinciple, independent, we expected significant deviation from

3FastFlow reserves some cores for load balancing, the farm emitter/collector.

Man.Time Refac. Time LOC Intro.Convolution 3 days 3 hours 58Ant Colony 1 day 1 hour 32

BasicN2 5 days 5 hours 40Graphical Lasso 15 hours 2 hours 53

Figure 3. Approximate manual implementation time of use-cases vs.refactoring time with lines of code introduced by refactoring tool

linear scaling for higher numbers of cores, because of cachesynchronisation (disjunct but interleaving memory regions areupdated in the tasks), and an uneven size combined with alimited number of tasks (48). At the end of the computation,some cores will wait idly for the completion of remainingtasks. Consequently, the observed performance matched ourexpectations, providing considerable speedup with a smallinvestment in manual code changes.

Table 3 shows approximate porting metrics for each usecase, with the time taken to implement the manual parallelFastFlow implementation by an expert, the time to parallelisethe sequential version using the refactoring tool, and thelines of code introduced by the refactoring tool. Clearly therefactoring tool gives an enormous saving in effort over themanual implementation of the FastFlow code.

V. RELATED WORK

Refactoring has a long history, with early work in the fieldbeing described by Partsch and Steinbruggen in 1983 [16], andMens and Tourwé producing a survey of refactoring tools andtechniques in 2004 [15]. The first refactoring tool system wasthe fold/unfold system of Burstall and Darlington [7] whichwas intended to transform recursively defined functions. Therehas so far been only a limited amount of work on refactoringfor parallelism [12]. We have previously [13] used TemplateHaskell [17] with explicit cost models to derive automaticfarm skeletons for Eden [14]. Unlike the approach presentedhere, Template-Haskell is compile-time, meaning that theprogrammer cannot continue to develop and maintain his/herprogram after the skeleton derivation has taken place. In [2],we introduced a parallel refactoring methodology for Erlangprograms, demonstrating a refactoring tool that introduces andtunes parallelism for Skeletons in Erlang. Unlike the workpresented here, the technique is limited to Erlang is demon-strated on a small and limited set of examples, and we didnot evaluate reductions in development time. Other work onparallel refactoring has mostly considered loop parallelisationin Fortran [19] and Java [10]. However, these approaches arelimited to concrete and fairly simple structural changes (suchas loop unrolling) rather than applying high-level pattern-basedrewrites as we have described here. We have recently extendedHaRe, the Haskell refactorer [5], to deal with a limitednumber of parallel refactorings [6]. This work allows Haskellprogrammers to introduce data and task parallelism using smallstructural refactoring steps. However, it does not use pattern-based rewriting or cost-based direction, as discussed here. Apreliminary proposal for a language-independent refactoringtool was presented in [3], for assisting programmers with in-troducing and tuning parallelism. However, that work focusedon building a refactoring tool supporting multiple languagesand paradigms, rather than on refactorings that introduce andtune parallelism using algorithm skeletons, as in this paper.

Applica'on*Structured*Code* Config.*1* Config.*2*

Config.*3*

1.#Iden(fy*Ini'al*Structure*

2.#Enumerate##Skeleton*

Configura'ons*

Config.*2*Config.*1*

4.*Apply*MCTS*

3.#Filter#Using*Cost*Model*

6.#Refactor#Applica'on*

Farm* Pipeline*

…*Int*main*()*…*For*(int*I*=0;*I*<*N;*i++)****f*(*g*(x));*…*

CPU* CPU*

CPU*

GPU*

Heterogeneous*Machine#

7.#Execute#

Refactorer*

Op'mal*Parallel**Configura'on*With*Mappings#

GPU*

CPU*CPU*CPU*

Refactorer*with*Mappings#

…*Int*main*()*…*Farm1*=*Farm(f,*8,*2);*Pipe(farm1,*GPU(g));*…**

Config.*1(b)*

…*

…*

GPU*Component*

CPU*Component* 5.#Choose#Op'mal*

Mapping/Configura'on*

Profile#Informa'on*

Profile#Informa'on*

Config.*1(a)* Config.*2(a)*

Heterogeneous Parallel Programming

[RGU / USTAN]

Example: Enumerate Skeleton Configura2ons for Image Convolu2on

r p

Δ(r p)

Δ(r) p

r Δ(p)Δ(r)Δ(p)

r || p

r ||Δ(p)

: read image file : process image file

rp

Results on Benchmark: Image Convolu2on

MCTS Mapping (C, G): (6, 0) || (0, 3)

Speedup 39.12 Best Speedup: 40.91

Conclusions

§  The manycore revolu9on is upon us §  Computer hardware is changing very rapidly

(more than in the last 50 years) §  The megacore era is here (aka exascale, BIG data)

§  Heterogeneity and energy are both important

§  Most programming models are too low-‐level §  concurrency based §  need to expose mass parallelism

§  Paberns and func:onal programming help with abstrac9on §  millions of threads, easily controlled

Conclusions (2)

§  Func9onal programming makes it easy to introduce parallelism §  No side effects means any computa9on could be parallel §  Matches pabern-‐based parallelism §  Much detail can be abstracted

§  Lots of problems can be avoided §  e.g. Freedom from Deadlock §  Parallel programs give the same results as sequen9al ones!

§  Automa9on is very important §  Refactoring drama9cally reduces development 9me

(while keeping the programmer in the loop) §  Machine learning is very promising for determining complex performance sewngs

But isn’t this all just wishful thinking?

66

Rampant-‐Lambda-‐Men in St Andrews

NO!

§  C++11 has lambda func9ons (and some other nice func9onal-‐inspired features)

§  Java 8 will have lambda (closures) §  Apple uses closures in Grand Central Dispatch

67

ParaPhrase Parallel C++ Refactoring

§  Integrated into Eclipse §  Supports full C++(11) standard §  Uses strongly hygienic components

§  func9onal encapsula9on (closures)

68

Image Convolu2on

69

ff_pipeline pipe; StreamGen streamgen(NIMGS,images); pipe.add_stage(&streamgen); pipe.add_stage(new genStage); pipe.add_stage(new filterStage); pipe.run_and_wait_end();

ff_farm<> gen_farm; gen_farm.add_collector(NULL); std::vector<ff_node*> gw; for (int i=0; i<nworkers; i++) gw.push_back(new gen_stage); gen_farm.add_workers(gw); ff_pipeline pipe; StreamGen streamgen(NIMGS,images); pipe.add_stage(&streamgen); pipe.add_stage(&gen_farm); pipe.add_stage(new filterStage); pipe.run_and_wait_end();

ff_farm<> gen_farm; gen_farm.add_collector(NULL); std::vector<ff_node*> gw; for (int i=0; i<nworkers; i++) gw.push_back(new gen_stage); gen_farm.add_workers(gw); ff_farm<> filter_farm; filter_farm.add_collector(NULL); std::vector<ff_node*> gw2; for (int i=0; i<nworkers2; i++) gw2.push_back(new CPU_Stage); filter_farm2.add_workers(gw2); StreamGen streamgen(NIMGS,images); ff_pipeline pipe; pipe.add_stage(&streamgen); pipe.add_stage(&gen_farm); pipe.add_stage(&filter_farm); pipe.run_and_wait_end();

Component<ff_im> genStage(generate); Component<ff_im> filterStage(filter); for(int i = 0; i<NIMGS; i++) { r1 = genStage.callWorker( new ff_im(images[i])); results[i] = filterStage.callWorker( new ff_im(r1)); }

Step%1:%Introduce%Components%

Step%2:%Introduce%Pipeline%

Step%3:%Introduce%Farm%Step%4:%Introduce%Farm%

Figure 2. Refactoring the Image Convolution Algorithm (Algorithm 1), first introducing parallel components (Step 1), then introducing a parallel pipeline (Step2), before adding two levels of farm (Steps 3 and 4).

pipeline waits for the result of the computation, using arun_and_wait_end() method call. Any dependencies be-tween the output of genStage and the input of filterStageare detected automatically.

Every refactoring that introduces new code has a correspond-ing inverse: for example, Farm Elimination inverts IntroduceFarm; and Pipeline Elimination inverts Introduce Pipeline.This allows any combination of rules to be inverted (undone).The refactorings are also fully nest-able, allowing, for example,a farm, such as �(f1�f2) (a task farm, �, where the workeris a function composition, �, of two components, f1 and f2), tobe refactored into �(f1 k f2) (transforming the compositioninto a parallel pipeline, k). In this way farms and pipelinesmay be formed and reformed in any way that is necessary forthe parallel application.

Our refactoring prototype2 is implemented in Eclipse,under the CDT plugin. The programmer is presented witha menu of possible refactorings to apply. The decision toapply a refactoring and introduce a skeleton is made by theprogrammer. Once a decision has been made, all required code-transformations are performed automatically. We therefore relyon the programmer to make informed decisions, and canexploit any knowledge/expertise that they may have, but donot require him/her to have deep expertise with parallelism.

2Available at: http://www.cs.st-andrews.ac.uk/~chrisb/refactorer.zip

Algorithm 1 Sequential Convolution Before Component In-troduction

1 for(int i = 0; i <NIMGS; i++) {2 r1 = generate(new ff_image(images[i]));3 results[i ] = filter (new ff_image(r1));4 }

III. USE CASES

In this section we illustrate the use of the refactorings aboveon a set of medium-scale realistic benchmarks. For each usecase, we start with the original sequential implementation inC++, and apply the refactorings from Section II in orderto obtain an optimised parallel version. The refactoring pro-cess demonstrated here relies on programmer knowledge toknow when and where to apply the refactorings, based onprofiling information and knowledge about the patterns fromSection I-B, and following the methodology of Section I-A. Inorder to properly evaluate our refactoring tool, parallelisationwas performed twice for each use-case: once using the refac-toring tool and once on a purely manual basis. The refactoring-based parallelisation will be discussed in detail below. Inall cases, Both versions give almost identical performance.However, the development time using refactoring was muchfaster, giving a clear and significant advantage of nearly oneorder of magnitude over the manual implementation.

Refactoring C++ in Eclipse

70

Funded by

•  ParaPhrase (EU FP7), Pa:erns for heterogeneous mul2core, €4.2M, 2011-‐2014

•  SCIEnce (EU FP6), Grid/Cloud/Mul2core coordina2on

•  €3.2M, 2005-‐2012

•  Advance (EU FP7), Mul2core streaming •  €2.7M, 2010-‐2013

•  HPC-‐GAP (EPSRC), Legacy system on thousands of cores •  £1.6M, 2010-‐2014

•  Islay (EPSRC), Real-‐2me FPGA streaming implementa2on •  £1.4M, 2008-‐2011

•  TACLE: European Cost Ac2on on Timing Analysis •  €300K, 2012-‐2015

74

Some of our Industrial Connec2ons

Mellanox Inc. Erlang Solu9ons Ltd SAP GmbH, Karlsrühe BAe Systems Selex Galileo BioId GmbH, Stubgart Philips Healthcare Sosware Competence Centre, Hagenberg Microsos Research Well-‐Typed LLC

75

ParaPhrase Needs You!

•  Please join our mailing list and help grow our user community §  news items §  access to free development sosware §  chat to the developers §  free developer workshops §  bug tracking and fixing §  Tools for both Erlang and C++

•  Subscribe at

•  We’re also looking for open source developers...

•  We also have 8 PhD studentships...

76

hbps://mailman.cs.st-‐andrews.ac.uk/mailman/ lis9nfo/paraphrase-‐news

Further Reading

Chris Brown. Hans-‐Wolfgang Loidl and Kevin Hammond “ParaForming Forming Parallel Haskell Programs using Novel Refactoring Techniques” Proc. 2011 Trends in FuncGonal Programming (TFP), Madrid, Spain, May 2011

Henrique Ferreiro, David Castro, Vladimir Janjic and Kevin Hammond “Repea2ng History: Execu2on Replay for Parallel Haskell Programs” Proc. 2012 Trends in FuncGonal Programming (TFP), St Andrews, UK, June 2012

Chris Brown. Marco Danelu:o, Kevin Hammond, Peter Kilpatrick and Sam Elliot “Cost-‐Directed Refactoring for Parallel Erlang Programs” To appear in InternaGonal Journal of Parallel Programming, 2013

Chris Brown. Vladimir Janjic, Kevin Hammond, Mehdi Goli and John McCall “Bridging the Divide: Intelligent Mapping for the Heterogeneous Parallel Programmer”, Submi?ed to IPDPS 2014

Vladimir Janjic, Chris Brown. Max Neunhoffer, Kevin Hammond, Steve Linton and Hans-‐Wolfgang Loidl “Space Explora2on using Parallel Orbits” Proc. PARCO 2013: Interna2onal Conf. on Parallel Compu2ng, Munich, Sept. 2013 Ask me for copies! Many technical results also on the project web site: free for download!

In Preparation

THANK YOU!

http://www.paraphrase-ict.eu!

@paraphrase_fp7

http://www.project-advance.eu!

80

ParaForming - Patterns and Refactoring for Parallel Programming

Education

Transcript of ParaForming - Patterns and Refactoring for Parallel Programming