ParaForming - Patterns and Refactoring for Parallel Programming

54
Paraforming: Forming Parallel (Func2onal) Programs from HighLevel Pa:erns using Advanced Refactoring Kevin Hammond, Chris Brown, Vladimir Janjic University of St Andrews, Scotland Build Stuff, Vilnius, Lithuania, December 10 2013 W: http://www.paraphrase-ict.eu T: @paraphrase_fp7, @khstandrews E: [email protected]

description

Despite Moore's "law", uniprocessor clock speeds have now stalled. Rather than single processors running at ever higher clock speeds, it is common to find dual-, quad- or even hexa-core processors, even in consumer laptops and desktops. Future hardware will not be slightly parallel, however, as in today's multicore systems, but will be massively parallel, with manycore and perhaps even megacore systems becoming mainstream. This means that programmers need to start thinking parallel. To achieve this they must move away from traditional programming models where parallelism is a bolted-on afterthought. Rather, programmers must use languages where parallelism is deeply embedded into the programming model from the outset. By providing a high level model of computation, without explicit ordering of computations, declarative languages in general, and functional languages in particular, offer many advantages for parallel programming. One of the most fundamental advantages of the functional paradigm is purity. In a purely functional language, as exemplified by Haskell, there are simply no side effects: it is therefore impossible for parallel computations to conflict with each other in ways that are not well understood. ParaForming aims to radically improve the process of parallelising purely functional programs through a comprehensive set of high-level parallel refactoring patterns for Parallel Haskell, supported by advanced refactoring tools. By matching parallel design patterns with appropriate algorithmic skeletons using advanced software refactoring techniques and novel cost information, we will bridge the gap between fully automatic and fully explicit approaches to parallelisation, helping programmers "think parallel" in a systematic, guided way. This talk introduces the ParaForming approach, gives some examples and shows how effective parallel programs can be developed using advanced refactoring technology.

Transcript of ParaForming - Patterns and Refactoring for Parallel Programming

Page 1: ParaForming - Patterns and Refactoring for Parallel Programming

Paraforming:  Forming  Parallel  (Func2onal)  Programs  from  High-­‐Level  Pa:erns    using  Advanced  Refactoring  

Kevin  Hammond,  Chris  Brown,  Vladimir  Janjic  University  of  St  Andrews,  Scotland  Build  Stuff,  Vilnius,  Lithuania,  December  10  2013  

W: http://www.paraphrase-ict.eu!

T:    @paraphrase_fp7,  @khstandrews  E:    [email protected]­‐andrews.ac.uk  

Page 2: ParaForming - Patterns and Refactoring for Parallel Programming

The  Present  

3  

Pound  versus  Dollar  

Page 3: ParaForming - Patterns and Refactoring for Parallel Programming

The  Future:  “megacore”  computers?  

§  Hundreds  of  thousands,  or  millions,  of  (small)  cores  

5  

Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  

Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  

Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  

Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  Core  

Page 4: ParaForming - Patterns and Refactoring for Parallel Programming

What  will  “megacore”  computers  look  like?  

§  Probably  not  just  scaled  versions  of  today’s  mul2core  §  Perhaps  hundreds  of  dedicated  lightweight  integer  units  §  Hundreds  of  floa9ng  point  units  (enhanced  GPU  designs)  §  A  few  heavyweight  general-­‐purpose  cores  §  Some  specialised  units  for  graphics,  authen9ca9on,  network  etc  §  possibly  so*  cores  (FPGAs  etc)  §  Highly  heterogeneous  

6  

Page 5: ParaForming - Patterns and Refactoring for Parallel Programming

What  will  “megacore”  computers  look  like?  

§  Probably  not  uniform  shared  memory  §  NUMA  is  likely,  even  hardware  distributed  shared  memory  §  or  even  message-­‐passing  systems  on  a  chip  §  shared-­‐memory  will  not  be  a  good  abstrac:on  

7  

int  arr      [x]  [y];  

Page 6: ParaForming - Patterns and Refactoring for Parallel Programming

8  

�����������������������������������������������������������������������������������������������������������������������������������������������������

Laki (NEC Nehalem Cluster) and hermit (XE6)

LakiI 700 dual socket Xeon 5560 2,8GHz

(“Gainestown”)I 12 GB DDR3 RAM / nodeI Infiniband (QDR)I 32 nodes with additional Nvidia Tesla

S1070I Scientific Linux 6.0

hermit (phase 1 step 1)I 38 racks with 96 nodes eachI 96 service nodes and 3552 compute

nodesI Each compute node will have 2 sockets

AMD Interlagos @ 2.3GHz 16 Coreseach leading to 113.664 cores

I Nodes with 32GB and 64GB memoryreflecting different user needs

I 2.7PB storage capacity @ 150GB/s IObandwidth

I External Access Nodes, Pre- &Postprocessing Nodes, RemoteVisualization Nodes

4/6 :: HLRS in ParaPhrase :: Turin, 4th/5th October 2011 ::

Page 7: ParaForming - Patterns and Refactoring for Parallel Programming

The  Biggest  Computer  in  the  World  

9  

Tianhe-­‐2,  Chinese  Na2onal  University  of  Defence  Technology    33.86  petaflops/s  (June  17,  2013)  16,000  Nodes;  each  with  2  Ivy  Bridge  mul9cores  and  3  Xeon  Phis  3,120,000  x86  cores  in  total!!!  

Page 8: ParaForming - Patterns and Refactoring for Parallel Programming

It’s  not  just  about  large  systems  

•  Even  mobile  phones  are  mul9core  §  Samsung  Exynos  5  Octa  has  8  cores,  4  of  

which  are  “dark”  

•  Performance/energy  tradeoffs  mean  systems  will  be  increasingly  parallel  

•  If  we  don’t  solve  the  mul9core  challenge,  then  no  other  advances  will  maber!  

10  

ALL  Future    Programming  will  be  

Parallel!  

Page 9: ParaForming - Patterns and Refactoring for Parallel Programming

The  Manycore  Challenge  

“Ul9mately,  developers  should  start  thinking  about  tens,  hundreds,  and  thousands  of  cores  now  in  their  algorithmic  development  and  deployment  pipeline.”      Anwar  Ghuloum,  Principal  Engineer,  Intel  Microprocessor  Technology  Lab  

“The  dilemma  is  that  a  large  percentage  of  mission-­‐cri9cal  enterprise  applica9ons  will  not  ``automagically''  run  faster  on  mul9-­‐core  servers.  In  fact,  many  will  actually  run  slower.  We  must  make  it  as  easy  as  possible  for  applica9ons  programmers  to  exploit  the  latest  developments  in  mul9-­‐core/many-­‐core  architectures,  while  s9ll  making  it  easy  to  target  future  (and  perhaps  unan9cipated)  hardware  developments.”    Patrick  Leonard,  Vice  President  for  Product  Development  Rogue  Wave  Sobware  

 The  ONLY  important  challenge  in  Computer  Science  (Intel)  

 Also  recognised  as  thema9c  priori9es  by  EU  and  na9onal  funding  bodies  

Page 10: ParaForming - Patterns and Refactoring for Parallel Programming

But  Doesn’t  that  mean  millions  of  threads  on  a  megacore  machine??  

13  

Page 11: ParaForming - Patterns and Refactoring for Parallel Programming

How  to  build  a  wall  

(with  apologies  to  Ian  Watson,  Univ.  Manchester)  

Page 12: ParaForming - Patterns and Refactoring for Parallel Programming

How  to  build  a  wall  faster  

Page 13: ParaForming - Patterns and Refactoring for Parallel Programming

How  NOT  to  build  a  wall  

Task  iden2fica2on  is  not  the  only  problem…  Must  also  consider  Coordina9on,  communica9on,  placement,  scheduling,  …  

Typical  CONCURRENCY  Approaches  require  the  Programmer  to  solve  these  

Page 14: ParaForming - Patterns and Refactoring for Parallel Programming

We  need  structure  We  need  abstrac2on    We  don’t  need  another  brick  in  the  wall  

17  

Page 15: ParaForming - Patterns and Refactoring for Parallel Programming

Thinking  Parallel  

§  Fundamentally,  programmers  must  learn  to  “think  parallel”  §  this  requires  new  high-­‐level  programming  constructs  

§  perhaps  dealing  with  hundreds  of  millions  of  threads  

§  You  cannot  program  effec2vely  while  worrying  about  deadlocks  etc.  §  they  must  be  eliminated  from  the  design!  

§  You  cannot  program  effec2vely  while  fiddling  with  communica2on  etc.  §  this  needs  to  be  packaged/abstracted!  

§  You  cannot  program  effec2vely  without  performance  informa2on  §  this  needs  to  be  included  as  part  of  the  design!  

18  

Page 16: ParaForming - Patterns and Refactoring for Parallel Programming

A  Solu2on?  

“The  only  thing  that  works  for  parallelism  is  func2onal  programming”  

Bob  Harper,  Carnegie  Mellon  University  

Page 17: ParaForming - Patterns and Refactoring for Parallel Programming

λ  

Parallel  Func2onal  Programming  

§  No  explicit  ordering  of  expressions  §  Purity  means  no  side-­‐effects  

§  Impossible  for  parallel  processes  to  interfere  with  each  other  §  Can  debug  sequen2ally  but  run  in  parallel  §  Enormous  saving  in  effort  

§  Programmer  concentrate  on  solving  the  problem  §  Not  por9ng  a  sequen9al  algorithm  into  a  (ill-­‐defined)  parallel  domain  

§  No  locks,  deadlocks  or  race  condi2ons!!  §  Huge  produc2vity  gains!   λ  λ  

Page 18: ParaForming - Patterns and Refactoring for Parallel Programming

ParaPhrase  Project:  Parallel  Pa:erns  for  Heterogeneous  Mul2core  Systems  (ICT-­‐288570),    2011-­‐2014,  €4.2M  budget    13  Partners,  8  European  countries  

 UK,  Italy,  Germany,  Austria,  Ireland,  Hungary,  Poland,  Israel    Coordinated  by  Kevin  Hammond  St  Andrews  

0  

Page 19: ParaForming - Patterns and Refactoring for Parallel Programming

The  ParaPhrase  Approach  

§  Start  bobom-­‐up  §  iden9fy    (strongly  hygienic)  COMPONENTS  §  using  semi-­‐automated  refactoring  

§  Think  about  the  PATTERN  of  parallelism  §  e.g.  map(reduce),  task  farm,  parallel  search,  parallel  comple9on,  ...  

§  STRUCTURE  the  components  into  a  parallel  program  §  turn  the  pa?erns  into  concrete  (skeleton)  code  §  Take  performance,  energy  etc.  into  account  (mul9-­‐objec9ve  op9misa9on)  §  also  using  refactoring  

§  RESTRUCTURE  if  necessary!  (also  using  refactoring)  25  

both  legacy  and  new  programs  

Page 20: ParaForming - Patterns and Refactoring for Parallel Programming

Some  Common  Pa:erns  

§  High-­‐level  abstract  paberns  of  common  parallel  algorithms  

35  

Google  map-­‐reduce  combines  two  of  these!  

Generally,  we  need  to  nest/combine  

paberns  in  arbitray  ways  

Page 21: ParaForming - Patterns and Refactoring for Parallel Programming

The  Skel  Library  for  Erlang  

§  Skeletons  implement  specific  parallel  paberns  §  Pluggable  templates  

§  Skel  is  a  new  (AND  ONLY!)  Skeleton  library  in  Erlang  §  map,  farm,  reduce,  pipeline,  feedback  §  instan9ated  using  skel:run  

§  Fully  Nestable  

§  A  DSL  for  parallelism  !OutputItems = skel:run(Skeleton, InputItems).!! 36  

chrisb.host.cs.st-­‐andrews.ac.uk/skel.html  

hbps://github.com/ParaPhrase/skel  

Page 22: ParaForming - Patterns and Refactoring for Parallel Programming

Parallel  Pipeline  Skeleton  

§  Each  stage  of  the  pipeline  can  be  executed  in  parallel  §  The  input  and  output  are  streams  

skel:run([{pipe,[Skel1, Skel2,..,SkelN]}], Inputs).!

!

  37  

Pipe

Tn · · · T1 T �n · · · T �

1

{pipe, [Skel1, Skel2, · · · , Skeln]}

Skel1 Skel2 Skeln· · ·

Inc = {seq , fun (X) -> X+1 end},

Double = {seq , fun (X) -> X*2 end},

skel:run({pipe , [Inc , Double]},

[1,2,3,4,5,6]).

% -> [4,6,8,10,12,14]

Page 23: ParaForming - Patterns and Refactoring for Parallel Programming

Farm  Skeleton    

§  Each  worker  is  executed  in  parallel  §  A  bit  like  a  1-­‐stage  pipeline  

 !skel:do([{farm, Skel, M}], Inputs).!

38  

Farm

Tn · · · T1 T �n · · · T �

1

...

Skel2

Skel1

{farm, Skel, M}

SkelM

Inc = {seq , fun(X)-> X+1 end},

skel:run({farm , Inc , 3},

[1,2,3,4,5,6]).

% -> [2,5,3,6,4,7]

Page 24: ParaForming - Patterns and Refactoring for Parallel Programming

39  

1 2 4 8 12 16 20 24

2

4

6

8

10

12

14

16

18

20

22

24

No. cores

Speedup

Speedups for Matrix Multiplication

Naive Parallel

Farm

Farm with Chunk 16

Using  The  Right  Pa:ern  Ma:ers  

Page 25: ParaForming - Patterns and Refactoring for Parallel Programming

The  ParaPhrase  Approach  

Erlang   Java  

Cos9ng/Profiling  

Erlang   Java  

Generic  Pa:ern  Library  

AMD  Opteron  

AMD  Opteron  

Intel  Core  

Intel  Core  

Nvidia  GPU  

Nvidia  GPU  

Intel  GPU  

Intel  GPU  

Nvidia  Tesla  

Intel  Xeon  Phi  

Mellanox  Infiniband  

...  Haskell  

...  Haskell  Parallel  Code  

SequenGal  Code  

Refactoring  

C/C++  

C/C++  

Page 26: ParaForming - Patterns and Refactoring for Parallel Programming

Refactoring  

§  Refactoring  changes  the  structure  of  the  source  code  §  using  well-­‐defined  rules  §  semi-­‐automa:cally  under  

programmer  guidance                    

Review"

Page 27: ParaForming - Patterns and Refactoring for Parallel Programming

Refactoring:  Farm  Introduc2on  

44  

Farm  

S1 � S2 ⌘ Pipe(S1, S2) pipe seqMap(S1 � S2, d, r) ⌘ Map(S1, d, r) �Map(S2, d, r) map fission/fusionS ⌘ Farm(S) farm intro/elimMap(F, d, r) ⌘ Pipe(Decomp(d)), Farm(F ), Recomp(r)) data2streamS1 ⌘ Map(S

01, d, r) map intro/elim

Figure 3.3: Some Standard Skeleton Equivalences

The following describes each of the patterns in turn:

• a MAP is made up of three OPERATIONs: a worker, a partitioner, and acombiner, followed by an INPUT;

• a SEQ is made up of a single OPERATION denoting the sequential compu-tation to be performed, followed by an INPUT;

• a FARM is made up of a single OPERATON denoting the working, an INPUTand the number of workers (NW); and,

• a PIPE is made up of at least one OPERATION denoting each stage of thepipeline, followed by an INPUT.

Each pattern has an IDENT attribute, allowing other patterns to reference it. Thisflattens the XML structure instead of nesting, providing a structure that is easier toread and reason about. Finally, an OPERATION is either a COMPONENT (describ-ing the module, name and language for a unit of computation) or another pattern(described by a reference to the pattern’s identifier).

3.2 Language-Independent Rewrite Rules

In this section we introduce a number of language-independent skeleton rewriterules that will be used to motivate the refactoring rules given in the remainder ofthe chapter. Figure 3.3 shows four well-known equivalences. Here Pipe, � andMap refer to the skeletons for parallel pipeline, function composition and parallelmap, respectively, S denotes any skeleton, and Seq(e) denotes that e is a sequentialcomputation. All skeletons work over streams. Decomp and Recomp are primi-tive operations that work over non-streaming inputs. d and r are simple functionsthat specify a decomposition (d :: a ! [b]), and a recomposition (r :: [b] ! a),as with the Partition and Combine functions above. Working from left toright in these equivalences introduces parallelism, while working from right to leftreduces parallelism. Reading from left to right, we can therefore interpret the rulesas follows:

30

Page 28: ParaForming - Patterns and Refactoring for Parallel Programming

Image  Processing  Example  

45  

Read  Image  1   Read  Image  2  

“White  screening”  

Merge  Images  

Write  Image  

{  Build  

{  Build  

         Stuff  }  

 

{  Build          Stuff  }  

Page 29: ParaForming - Patterns and Refactoring for Parallel Programming

Basic  Erlang  Structure  

[ writeImage(convertMerge(readImage(X))) !! ! ! ! !|| X <- Images() ]!

!readImage({In1, in2, out) ->!

!…!!{ Image1, Image2, out}.!

!convertImage({Image1, Image2, out}) ->!

!Image1P = whiteScreen(Image1),!!Image2P = mergeImages(Image1, Image2),!!{Image2P, out}.!

!writeImage({Image, Out}) -> …!

   

46  

Page 30: ParaForming - Patterns and Refactoring for Parallel Programming

Refactoring  Demo  

47  

Page 31: ParaForming - Patterns and Refactoring for Parallel Programming

Refactoring  Demo  

48  

Page 32: ParaForming - Patterns and Refactoring for Parallel Programming

Speedup  Results  (Image  Processing)  

50  

1 2 4 8 12 16 20 24

124681012141618202224

No. Farm Workers

Speedup

Speedups for Haar Transform (Skel Task Farm)

1D Skel Task Farm1D Skel Task Farm with Chunk Size = 42D Skel Task Farm

Page 33: ParaForming - Patterns and Refactoring for Parallel Programming

Large-­‐Scale  Demonstrator  Applica2ons  

§  ParaPhrase  tools  are  being  used  by  commercial/end-­‐user  partners  §  SCCH  (SME,  Austria)  §  Erlang  Solu9ons  Ltd  (SME,  UK)  §  Mellanox  (Israel)  §  ELTESos,  Hungary  (SME)  §  AGH  (University,  Poland)  §  HLRS  (High  Performance  Compu9ng  Centre,  Germany)  

Page 34: ParaForming - Patterns and Refactoring for Parallel Programming

Speedup  Results  (demonstrators)  

55  

1 2 4 6 8 10 12 14 16

1

2

4

6

8

10

No of �2 workers

Spee

dup

Speedups for Convolution �1(G) k �2(F )

�1 = 1

�1 = 2

�1 = 4

�1 = 6

�1 = 8

�1 = 10

1 2 4 6 8 10 12 14 16 18 20 22 24

1

2

4

6

8

10

12

14

16

18

20

22

24

No of Workers

Spee

dup

Speedups for Ant Colony, BasicN2 and Graphical Lasso

BasicN2

BasicN2 Manual

Graphical Lasso

Graphical Lasso Manual

Ant Colony Optimisation Manual

Ant Colony Optimisation

Figure 3. Refactored Use Case Results in FastFlow

code and simply points the refactoring tool towards them. Theactual parallelisation is then performed by the refactoring tool,supervised by the programmer. This can give significant sav-ings in effort, of about one order of magnitude. This is achievedwithout major performance losses: as desired, the speedupsachieved with the refactoring tool are approximately the sameas for full-scale manual implementations by an expert. Infuture we expect to develop this work in a number of newdirections, including adding advanced performance models tothe refactoring process, thus allowing the user to accuratelypredict the parallel performance from applying a particularrefactoring with a specified number of threads. This may beparticularly useful when porting the applications to differentarchitectures, including adding refactoring support for GPUprogramming in OpenCl. Also, once sufficient automisationof the refactoring tool is achieved, the best parametrisationregarding parallel efficiency can be determined via optimisa-tion, further facilitating this approach. In addition, we alsoplan to implement more skeletons, particularly in the field ofcomputer alegbra and physics, and demonstrate the refactoringapproach with these new skeletons on a wide range of realisticapplications. This will add to the evidence that our approach isgeneral, usable and scalable. Finally, we intend to investigatethe limits of scalability that we have obvserved for some of ouruse-cases, aiming to determine whether the limits are hardwareartefacts or algorithmic.

REFERENCES

[1] M. Aldinucci, M. Danelutto, P. Kilpatrick and M. Torquati. FastFlow:High-Level and Efficient Streaming on Multi-Core. ProgrammingMulti-core and Many-core Computing Systems. Parallel and DistributedComptuing. Chap. 13, 2013. Wiley.

[2] Michael P Allen. Introduction to Molecular Dynamics Simulation.Computational Soft Matter: From Synthetic Polymers to Proteins, 23:1–28, 2004.

[3] M. den Besten, T. Stuetzle, M. Dorigo. Ant Colony Optimization forthe Total Weighted Tardiness Problem PPSN 6, p611-620, Sept. 2000.

[4] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, and A. Elliott.Cost-Directed Refactoring for Parallel Erlang Programs. in Interna-tional Journal Parallel Processing. HLPP 2013 Special Issue. Springer.Paris, September 2013. DOI 10.1007/s10766-013-0266-5

[5] C. Brown, K. Hammond, M. Danelutto, and P. Kilpatrick. A Language-Independent Parallel Refactoring Framework. in Proc. of the FifthWorkshop on Refactoring Tools (WRT ’12)., Pages 54-58. ACM, NewYork, USA. 2012.

[6] C. Brown, H. Li, and S. Thompson. An Expression Processor: A CaseStudy in Refactoring Haskell Programs. Eleventh Symp. on Trends inFunc. Prog., May 2010.

[7] C. Brown, H. Loidl, and K. Hammond. Paraforming: Forming HaskellPrograms using Novel Refactoring Techniques. 12th Symp. on Trendsin Func. Prog., Spain, May 2011.

[8] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, H. Schöner,and T. Breddin. Paraphrasing: Generating Parallel Programs UsingRefactoring. In 10th International Symposium, FMCO 2011. Turin,Italy, October 3-5, 2011. Revised Selected Papers. Springer-Berlin-Heidelberg. Pages 237-256.

[9] R. M. Burstall and J. Darlington. A Transformation System forDeveloping Recursive Programs. J. of the ACM, 24(1):44–67, 1977.

[10] M. Cole. Algorithmic Skeletons: Structured Management of ParallelComputations. Research Monographs in Par. and Distrib. Computing.Pitman, 1989.

[11] M. Cole. Bringing Skeletons out of the Closet: A Pragmatic Manifestofor Skeletal Parallel Programming. Par. Computing, 30(3):389–406,2004.

[12] D. Dig. A Refactoring Approach to Parallelism. IEEE Softw., 28:17–22,January 2011.

[13] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse In-verse Covariance Estimation with the Graphical Lasso. Biostatistics,9(3):432–441, July 2008.

[14] R. Loogen, Y. Ortega-Mallén, and R. Peña-Marí. Parallel Func. Prog.in Eden. J. of Func. Prog., 15(3):431–475, 2005.

[15] T. Mens and T. Tourwé. A Survey of Software Refactoring. IEEETrans. Softw. Eng., 30(2):126–139, 2004.

[16] H. Partsch and R. Steinbruggen. Program Transformation Systems.ACM Comput. Surv., 15(3):199–236, 1983.

[17] K. Hammond, M. Aldinucci, C. Brown, F. Cesarini, M. Danelutto,H. Gonzalez-Velez, P. Kilpatrick, R. Keller, T. Natschlager, andG. Shainer. The ParaPhrase Project: Parallel Patterns for AdaptiveHeterogeneous Multicore Systems. FMCO. Feb. 2012.

[18] K. Hammond, J. Berthold, and R. Loogen. Automatic Skeletonsin Template Haskell. Parallel Processing Letters, 13(3):413–424,September 2003.

[19] W. Opdyke. Refactoring Object-Oriented Frameworks. PhD Thesis,Dept. of Comp Sci, University of Illinois at Urbana-Champaign, Cham-paign, IL, USA (1992).

[20] T. Sheard and S. P. Jones. Template Meta-Programming for Haskell.SIGPLAN Not., 37:60–75, December 2002.

[21] D. B. Skillicorn and W. Cai. A Cost Calculus for Parallel FunctionalProgramming. J. Parallel Distrib. Comput., 28(1):65–83, 1995.

[22] J. Wloka, M. Sridharan, and F. Tip. Refactoring for reentrancy. InESEC/FSE ’09, pages 173–182, Amsterdam, 2009. ACM.

Speedup  close  to  or  beHer  than  manual  op9misa9on  

Page 35: ParaForming - Patterns and Refactoring for Parallel Programming

Bow2e2:  most  widely  used  DNA  alignment  tool  

56  

14

16

18

20

22

24

26

28

20 30 40 50 60 70 80 90 100 110

Sp

ee

du

p

Read Length

Bt2FF-pin+intBt2

15

20

25

30

28 30 32 34 36 38 40

Sp

ee

du

p

Quality

Bt2FF-pin+intBt2

Original  

Paraphrase  

C.  Misale.  Accelera9ng  Bow9e2  with  a  lock-­‐less  concurrency  approach  and  memory  affinity.  IEEE  PDP  2014.  To  appear.  

Page 36: ParaForming - Patterns and Refactoring for Parallel Programming

Comparison  of  Development  Times  

58  

figure. The convolution is defined as a two stage pipeline (k),with the first stage being a farm (�) that generates the images(G), and the second stage is a farm that filters the images (F ).In the convolution, the maximum speedup obtained from therefactored version is 6.59 with 2 workers in the first farm and8 workers in the second farm. There are also three workersfor each pipeline stage, (two for the farm stages, and one forthe initial streaming stage), plus threads for the load balancersin the farm, giving a total of 15 threads. Here, the nature ofthe application may limit the scalability. The second stage ofthe pipeline dominates the computation: the first stage takeson average 0.6 seconds and the second stage takes around7 seconds to process one image, resulting in a substantialbottleneck in the second stage.

The Ant Colony Optimisation scales linearly to give aspeedup of 7.29 using 8 farm workers, after which the per-formance starts to drop noticeably. We observe only relativelymodest speedups (between 3 and 4) with more than 12 farmworkers. The Ant Colony Optimisation is quite a memory-intense application, and all of the farm workers need to accessa large amount of shared data (processing times, deadlinesand weights of jobs, together with ⌧ matrix), especially sincewe are considering instances where the number of jobs to bescheduled is large. We suspect that the drop in performanceis due to expensive memory accesses for those farm workersthat are placed on remote cores (i.e., not in the same processorpackage). The fact that the decrease in performance occurs atabout 10 farm workers confirms this: this is exactly the pointwhere not all of the farm workers can be placed on cores fromone package3. However, more testing is needed to preciselypinpoint the reasons for the drop in performance.

For the BasicN2 use case (shown in Listing 4 in theright column), the refactored version achieves a speedup of21.2 with 24 threads. The application scales well, with nearlinear speedups (up to 11.15 with 12 threads). After 12threads, the speedups decrease slightly, most likely because therefactored code dynamically allocates memory for the tasksduring the computation, resulting in some small overhead.FastFlow also reserves two workers: one worker for the loadbalancer and one for the farm skeleton, so the maximumspeedup achievable with 24 threads is only 21.2. BasicN2gives scalable speedups due to its data-parallel structure, whereeach task is independent, and the main computation over eachtask dominates the computation. In Listing 4, we also showthe manually parallelised version in FastFlow, which achievescomparable speedups of 22.68 with 24 threads. The manualrefactored code achieves slightly better speedups due to thefact that only one FastFlow farm is introduced in the code.However, in the refactored version, due to the refactoring tool’slimitation, we introduce two FastFlow farms with an additionsynchronisation point between them. The refactoring tool doesnot yet have provision to merge two routines into one, whichcan be achieved by an experienced C++ programmer.

The Graphical Lasso use case gives a scalable speedup of9.8, for 16 cores, and stagnates afterwards. This is similar tomanually ported FastFlow code, and to results obtained withan OpenMP port (where we achieved a maximum speedup of11.3 on 16 cores). Although the tasks parallelised here are, inprinciple, independent, we expected significant deviation from

3FastFlow reserves some cores for load balancing, the farm emitter/collector.

Man.Time Refac. Time LOC Intro.Convolution 3 days 3 hours 58Ant Colony 1 day 1 hour 32

BasicN2 5 days 5 hours 40Graphical Lasso 15 hours 2 hours 53

Figure 3. Approximate manual implementation time of use-cases vs.refactoring time with lines of code introduced by refactoring tool

linear scaling for higher numbers of cores, because of cachesynchronisation (disjunct but interleaving memory regions areupdated in the tasks), and an uneven size combined with alimited number of tasks (48). At the end of the computation,some cores will wait idly for the completion of remainingtasks. Consequently, the observed performance matched ourexpectations, providing considerable speedup with a smallinvestment in manual code changes.

Table 3 shows approximate porting metrics for each usecase, with the time taken to implement the manual parallelFastFlow implementation by an expert, the time to parallelisethe sequential version using the refactoring tool, and thelines of code introduced by the refactoring tool. Clearly therefactoring tool gives an enormous saving in effort over themanual implementation of the FastFlow code.

V. RELATED WORK

Refactoring has a long history, with early work in the fieldbeing described by Partsch and Steinbruggen in 1983 [16], andMens and Tourwé producing a survey of refactoring tools andtechniques in 2004 [15]. The first refactoring tool system wasthe fold/unfold system of Burstall and Darlington [7] whichwas intended to transform recursively defined functions. Therehas so far been only a limited amount of work on refactoringfor parallelism [12]. We have previously [13] used TemplateHaskell [17] with explicit cost models to derive automaticfarm skeletons for Eden [14]. Unlike the approach presentedhere, Template-Haskell is compile-time, meaning that theprogrammer cannot continue to develop and maintain his/herprogram after the skeleton derivation has taken place. In [2],we introduced a parallel refactoring methodology for Erlangprograms, demonstrating a refactoring tool that introduces andtunes parallelism for Skeletons in Erlang. Unlike the workpresented here, the technique is limited to Erlang is demon-strated on a small and limited set of examples, and we didnot evaluate reductions in development time. Other work onparallel refactoring has mostly considered loop parallelisationin Fortran [19] and Java [10]. However, these approaches arelimited to concrete and fairly simple structural changes (suchas loop unrolling) rather than applying high-level pattern-basedrewrites as we have described here. We have recently extendedHaRe, the Haskell refactorer [5], to deal with a limitednumber of parallel refactorings [6]. This work allows Haskellprogrammers to introduce data and task parallelism using smallstructural refactoring steps. However, it does not use pattern-based rewriting or cost-based direction, as discussed here. Apreliminary proposal for a language-independent refactoringtool was presented in [3], for assisting programmers with in-troducing and tuning parallelism. However, that work focusedon building a refactoring tool supporting multiple languagesand paradigms, rather than on refactorings that introduce andtune parallelism using algorithm skeletons, as in this paper.

Page 37: ParaForming - Patterns and Refactoring for Parallel Programming

Applica'on*Structured*Code* Config.*1* Config.*2*

Config.*3*

1.#Iden(fy*Ini'al*Structure*

2.#Enumerate##Skeleton*

Configura'ons*

Config.*2*Config.*1*

4.*Apply*MCTS*

3.#Filter#Using*Cost*Model*

6.#Refactor#Applica'on*

Farm* Pipeline*

…*Int*main*()*…*For*(int*I*=0;*I*<*N;*i++)****f*(*g*(x));*…*

CPU* CPU*

CPU*

GPU*

Heterogeneous*Machine#

7.#Execute#

Refactorer*

Op'mal*Parallel**Configura'on*With*Mappings#

GPU*

CPU*CPU*CPU*

Refactorer*with*Mappings#

…*Int*main*()*…*Farm1*=*Farm(f,*8,*2);*Pipe(farm1,*GPU(g));*…**

Config.*1(b)*

…*

…*

GPU*Component*

CPU*Component* 5.#Choose#Op'mal*

Mapping/Configura'on*

Profile#Informa'on*

Profile#Informa'on*

Config.*1(a)* Config.*2(a)*

Heterogeneous  Parallel  Programming  

[RGU  /  USTAN]  

Page 38: ParaForming - Patterns and Refactoring for Parallel Programming

Example:  Enumerate  Skeleton  Configura2ons  for  Image  Convolu2on  

r p

Δ(r p)

Δ(r) p

r Δ(p)Δ(r)Δ(p)

r || p

r ||Δ(p)

 :  read  image  file    :  process  image  file  

rp

Page 39: ParaForming - Patterns and Refactoring for Parallel Programming

Results  on  Benchmark:  Image  Convolu2on  

MCTS  Mapping  (C,  G):        (6,  0)  ||  (0,  3)  

Speedup  39.12    Best  Speedup:  40.91  

Page 40: ParaForming - Patterns and Refactoring for Parallel Programming

Conclusions  

§  The  manycore  revolu9on  is  upon  us  §  Computer  hardware  is  changing  very  rapidly  

(more  than  in  the  last  50  years)  §  The  megacore  era  is  here  (aka  exascale,  BIG  data)  

§  Heterogeneity  and  energy  are  both  important  

§  Most  programming  models  are  too  low-­‐level  §  concurrency  based  §  need  to  expose  mass  parallelism  

§  Paberns  and  func:onal  programming  help  with  abstrac9on  §  millions  of  threads,  easily  controlled  

 

Page 41: ParaForming - Patterns and Refactoring for Parallel Programming

Conclusions  (2)  

§  Func9onal  programming  makes  it  easy  to  introduce  parallelism  §  No  side  effects  means  any  computa9on  could  be  parallel  §  Matches  pabern-­‐based  parallelism  §  Much  detail  can  be  abstracted  

§  Lots  of  problems  can  be  avoided  §  e.g.  Freedom  from  Deadlock  §  Parallel  programs  give  the  same  results  as  sequen9al  ones!  

§  Automa9on  is  very  important  §  Refactoring  drama9cally  reduces  development  9me  

(while  keeping  the  programmer  in  the  loop)  §  Machine  learning  is  very  promising  for  determining  complex  performance  sewngs  

   

 

Page 42: ParaForming - Patterns and Refactoring for Parallel Programming

But  isn’t  this  all  just  wishful  thinking?  

66  

Rampant-­‐Lambda-­‐Men  in  St  Andrews  

Page 43: ParaForming - Patterns and Refactoring for Parallel Programming

NO!  

§  C++11  has  lambda  func9ons  (and  some  other  nice  func9onal-­‐inspired  features)  

§  Java  8  will  have  lambda  (closures)  §  Apple  uses  closures  in  Grand  Central  Dispatch  

67  

Page 44: ParaForming - Patterns and Refactoring for Parallel Programming

ParaPhrase  Parallel  C++  Refactoring  

§  Integrated  into  Eclipse  §  Supports  full  C++(11)  standard  §  Uses  strongly  hygienic  components  

§  func9onal  encapsula9on  (closures)  

68  

Page 45: ParaForming - Patterns and Refactoring for Parallel Programming

Image  Convolu2on  

69  

ff_pipeline pipe; StreamGen streamgen(NIMGS,images); pipe.add_stage(&streamgen); pipe.add_stage(new genStage); pipe.add_stage(new filterStage); pipe.run_and_wait_end();

ff_farm<> gen_farm; gen_farm.add_collector(NULL); std::vector<ff_node*> gw; for (int i=0; i<nworkers; i++) gw.push_back(new gen_stage); gen_farm.add_workers(gw); ff_pipeline pipe; StreamGen streamgen(NIMGS,images); pipe.add_stage(&streamgen); pipe.add_stage(&gen_farm); pipe.add_stage(new filterStage); pipe.run_and_wait_end();

ff_farm<> gen_farm; gen_farm.add_collector(NULL); std::vector<ff_node*> gw; for (int i=0; i<nworkers; i++) gw.push_back(new gen_stage); gen_farm.add_workers(gw); ff_farm<> filter_farm; filter_farm.add_collector(NULL); std::vector<ff_node*> gw2; for (int i=0; i<nworkers2; i++) gw2.push_back(new CPU_Stage); filter_farm2.add_workers(gw2); StreamGen streamgen(NIMGS,images); ff_pipeline pipe; pipe.add_stage(&streamgen); pipe.add_stage(&gen_farm); pipe.add_stage(&filter_farm); pipe.run_and_wait_end();

Component<ff_im> genStage(generate); Component<ff_im> filterStage(filter); for(int i = 0; i<NIMGS; i++) { r1 = genStage.callWorker( new ff_im(images[i])); results[i] = filterStage.callWorker( new ff_im(r1)); }

Step%1:%Introduce%Components%

Step%2:%Introduce%Pipeline%

Step%3:%Introduce%Farm%Step%4:%Introduce%Farm%

Figure 2. Refactoring the Image Convolution Algorithm (Algorithm 1), first introducing parallel components (Step 1), then introducing a parallel pipeline (Step2), before adding two levels of farm (Steps 3 and 4).

pipeline waits for the result of the computation, using arun_and_wait_end() method call. Any dependencies be-tween the output of genStage and the input of filterStageare detected automatically.

Every refactoring that introduces new code has a correspond-ing inverse: for example, Farm Elimination inverts IntroduceFarm; and Pipeline Elimination inverts Introduce Pipeline.This allows any combination of rules to be inverted (undone).The refactorings are also fully nest-able, allowing, for example,a farm, such as �(f1�f2) (a task farm, �, where the workeris a function composition, �, of two components, f1 and f2), tobe refactored into �(f1 k f2) (transforming the compositioninto a parallel pipeline, k). In this way farms and pipelinesmay be formed and reformed in any way that is necessary forthe parallel application.

Our refactoring prototype2 is implemented in Eclipse,under the CDT plugin. The programmer is presented witha menu of possible refactorings to apply. The decision toapply a refactoring and introduce a skeleton is made by theprogrammer. Once a decision has been made, all required code-transformations are performed automatically. We therefore relyon the programmer to make informed decisions, and canexploit any knowledge/expertise that they may have, but donot require him/her to have deep expertise with parallelism.

2Available at: http://www.cs.st-andrews.ac.uk/~chrisb/refactorer.zip

Algorithm 1 Sequential Convolution Before Component In-troduction

1 for(int i = 0; i <NIMGS; i++) {2 r1 = generate(new ff_image(images[i]));3 results[i ] = filter (new ff_image(r1));4 }

III. USE CASES

In this section we illustrate the use of the refactorings aboveon a set of medium-scale realistic benchmarks. For each usecase, we start with the original sequential implementation inC++, and apply the refactorings from Section II in orderto obtain an optimised parallel version. The refactoring pro-cess demonstrated here relies on programmer knowledge toknow when and where to apply the refactorings, based onprofiling information and knowledge about the patterns fromSection I-B, and following the methodology of Section I-A. Inorder to properly evaluate our refactoring tool, parallelisationwas performed twice for each use-case: once using the refac-toring tool and once on a purely manual basis. The refactoring-based parallelisation will be discussed in detail below. Inall cases, Both versions give almost identical performance.However, the development time using refactoring was muchfaster, giving a clear and significant advantage of nearly oneorder of magnitude over the manual implementation.

Page 46: ParaForming - Patterns and Refactoring for Parallel Programming

Refactoring  C++  in  Eclipse  

70  

Page 47: ParaForming - Patterns and Refactoring for Parallel Programming
Page 48: ParaForming - Patterns and Refactoring for Parallel Programming

Funded  by  

•  ParaPhrase  (EU  FP7),  Pa:erns  for  heterogeneous  mul2core,    €4.2M,  2011-­‐2014  

 •  SCIEnce  (EU  FP6),  Grid/Cloud/Mul2core  coordina2on  

•  €3.2M,  2005-­‐2012    

•  Advance  (EU  FP7),  Mul2core  streaming  •  €2.7M,  2010-­‐2013  

•  HPC-­‐GAP  (EPSRC),  Legacy  system  on  thousands  of  cores  •  £1.6M,  2010-­‐2014  

•  Islay  (EPSRC),  Real-­‐2me  FPGA  streaming  implementa2on  •  £1.4M,  2008-­‐2011  

•  TACLE:  European  Cost  Ac2on  on  Timing  Analysis  •  €300K,  2012-­‐2015  

74  

Page 49: ParaForming - Patterns and Refactoring for Parallel Programming

Some  of  our  Industrial  Connec2ons  

Mellanox  Inc.  Erlang  Solu9ons  Ltd  SAP  GmbH,  Karlsrühe  BAe  Systems  Selex  Galileo  BioId  GmbH,  Stubgart  Philips  Healthcare  Sosware  Competence  Centre,  Hagenberg  Microsos  Research  Well-­‐Typed  LLC    

75  

Page 50: ParaForming - Patterns and Refactoring for Parallel Programming

ParaPhrase  Needs  You!  

•  Please  join  our  mailing  list  and  help  grow  our  user  community  §  news  items  §  access  to  free  development  sosware  §  chat  to  the  developers  §  free  developer  workshops  §  bug  tracking  and  fixing  §  Tools  for  both  Erlang  and  C++  

•  Subscribe  at  

•  We’re  also  looking  for  open  source  developers...  

•  We  also  have  8  PhD  studentships...  

76  

hbps://mailman.cs.st-­‐andrews.ac.uk/mailman/  lis9nfo/paraphrase-­‐news  

Page 51: ParaForming - Patterns and Refactoring for Parallel Programming

Further  Reading  

Chris  Brown.  Hans-­‐Wolfgang  Loidl  and  Kevin  Hammond  “ParaForming  Forming  Parallel  Haskell  Programs  using  Novel  Refactoring  Techniques”  Proc.    2011  Trends  in  FuncGonal  Programming  (TFP),  Madrid,  Spain,  May  2011  

Henrique  Ferreiro,  David  Castro,  Vladimir  Janjic  and  Kevin  Hammond  “Repea2ng  History:  Execu2on  Replay  for  Parallel  Haskell  Programs”  Proc.  2012  Trends  in  FuncGonal  Programming  (TFP),  St  Andrews,  UK,  June  2012  

Chris  Brown.  Marco  Danelu:o,  Kevin  Hammond,  Peter  Kilpatrick  and  Sam  Elliot  “Cost-­‐Directed  Refactoring  for  Parallel  Erlang  Programs”  To  appear  in  InternaGonal  Journal  of  Parallel  Programming,  2013  

Chris  Brown.  Vladimir  Janjic,  Kevin  Hammond,  Mehdi  Goli  and  John  McCall  “Bridging  the  Divide:  Intelligent  Mapping  for  the  Heterogeneous  Parallel  Programmer”,  Submi?ed  to  IPDPS  2014  

Vladimir  Janjic,  Chris  Brown.  Max  Neunhoffer,  Kevin  Hammond,  Steve  Linton  and  Hans-­‐Wolfgang  Loidl  “Space  Explora2on  using  Parallel  Orbits”  Proc.  PARCO  2013:  Interna2onal  Conf.  on  Parallel  Compu2ng,  Munich,  Sept.  2013  Ask  me  for  copies!  Many  technical  results  also  on  the  project  web  site:  free  for  download!  

Page 52: ParaForming - Patterns and Refactoring for Parallel Programming
Page 53: ParaForming - Patterns and Refactoring for Parallel Programming

In Preparation

Page 54: ParaForming - Patterns and Refactoring for Parallel Programming

THANK  YOU!  

http://www.paraphrase-ict.eu!

 @paraphrase_fp7  

http://www.project-advance.eu!

80