ParaFormance TM Democratising Parallel Software

21
ParaFormance TM Democratising Parallel Software Chris Brown @paraformance www.paraformance.com [email protected]

Transcript of ParaFormance TM Democratising Parallel Software

ParaFormanceTMDemocratising Parallel

SoftwareChris Brown

@[email protected]

A Scottish Startup

• £600k Scottish Enterprise grant money so far…• … built on over £7M of EU funding.• Looking to spin out from the University of St

Andrews• A team of 4 full time software developers• Looking for pre-revenue investment• Looking for triallists!

Cyber Physical Systems

• “Kilo-Core”

• 1000 independent programmable processors

• Designed by a team at the University of California, Davis

• 1.78 trillion instructions per second and contains 621 million transistors

• Each processor is independently clocked, it can shut itself down to further save energy

• 1,000 processors execute 115 billion instructions per second using 0.7 Watts

• Powered by a single AA battery

The world’s first 1000 core processor

Multi-core Software is Difficult!

Multi-Threaded Programming

C++

ParaFormance™ Technology

Discovery

Refactoring

Repair

ParaFormanceTM

8

Windows Mac OS X Linux

Parallel Libraries

1.OpenMP§ Pragma based

2.Intel TBB§ Pattern based

3.Others…§ MPI§ PThreads§ FastFlow§ …

9

Parallelism Discovery• Profiles execution of application

• Locates “hot spots” of computation

• Goal is to find instances of patterns and inform user “best” pattern to choose

Safety Checking

Checks code for potential thread safe violations using Static Analysis• Race conditions• Array collisions• Variable accesses• Private variables• Critical regions

Automatic Repairing

• Repairs code to make it ’thread safe’• Refactors code to remove

potential sources of thread violations• Introduces local variables• Array collisions

Refactoring

• Rewrites code into a parallel version• Portable across range of

different types of parallelism:• TBB, OpenMP, Pthreads, etc.

Modify Refactor

Demonstration

Examples of ParaFormanceParaFormance is designed to be general, and we have tried it on many different types of application:

Machine learning, ant colony optimisation, linear programming, image processing, CFD…

ExamplesofParaFormance

• ParaFormance™isdesignedtobegeneral,andwehavetrieditonmanydifferenttypesofapplication:– Machinelearning,antcolonyoptimisation,linearprogramming,imageprocessing,CFD,…

Weather ForecastingWeatherForecasting

InitialresultsofParaFormance™onaweatherforecastingapplication

2.5million lines

300+files

1200+potentialsourcesofparallelism

Paraformance narrowsdownto27possibleparallelismsites

1monthofmanualeffortreducedtoonly5minutes

• 2.5 million lines• 300+ files• 1200+ potential sources of

parallelism• Paraformance narrows down

to 27 possible parallelism sites

1 month of manual effort reduced to only 5 minutes!

Comparison of Development Times

Man. Time Refac. TimeConvolution 24 hours 3 hours

Ant Colony 8 hours 1 hourBasic N2 40 hours 5 hours

Graphical Lasso 15 hours 2 hours

Comparable Performance

18

1 2 4 6 8 10 12 14 16

1

2

4

6

8

10

No of �2 workers

Spee

dup

Speedups for Convolution �1(G) k �2(F )

�1 = 1

�1 = 2

�1 = 4

�1 = 6

�1 = 8

�1 = 10

1 2 4 6 8 10 12 14 16 18 20 22 24

124681012141618202224

No of Workers

Spee

dup

Speedups for Ant Colony, BasicN2 and Graphical Lasso

BasicN2

BasicN2 Manual

Graphical Lasso

Graphical Lasso Manual

Ant Colony Optimisation Manual

Ant Colony Optimisation

Figure 3. Refactored Use Case Results in FastFlow

code and simply points the refactoring tool towards them. Theactual parallelisation is then performed by the refactoring tool,supervised by the programmer. This can give significant sav-ings in effort, of about one order of magnitude. This is achievedwithout major performance losses: as desired, the speedupsachieved with the refactoring tool are approximately the sameas for full-scale manual implementations by an expert. Infuture we expect to develop this work in a number of newdirections, including adding advanced performance models tothe refactoring process, thus allowing the user to accuratelypredict the parallel performance from applying a particularrefactoring with a specified number of threads. This may beparticularly useful when porting the applications to differentarchitectures, including adding refactoring support for GPUprogramming in OpenCl. Also, once sufficient automisationof the refactoring tool is achieved, the best parametrisationregarding parallel efficiency can be determined via optimisa-tion, further facilitating this approach. In addition, we alsoplan to implement more skeletons, particularly in the field ofcomputer alegbra and physics, and demonstrate the refactoringapproach with these new skeletons on a wide range of realisticapplications. This will add to the evidence that our approach isgeneral, usable and scalable. Finally, we intend to investigatethe limits of scalability that we have obvserved for some of ouruse-cases, aiming to determine whether the limits are hardwareartefacts or algorithmic.

REFERENCES

[1] M. Aldinucci, M. Danelutto, P. Kilpatrick and M. Torquati. FastFlow:High-Level and Efficient Streaming on Multi-Core. ProgrammingMulti-core and Many-core Computing Systems. Parallel and DistributedComptuing. Chap. 13, 2013. Wiley.

[2] Michael P Allen. Introduction to Molecular Dynamics Simulation.Computational Soft Matter: From Synthetic Polymers to Proteins, 23:1–28, 2004.

[3] M. den Besten, T. Stuetzle, M. Dorigo. Ant Colony Optimization forthe Total Weighted Tardiness Problem PPSN 6, p611-620, Sept. 2000.

[4] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, and A. Elliott.Cost-Directed Refactoring for Parallel Erlang Programs. in Interna-tional Journal Parallel Processing. HLPP 2013 Special Issue. Springer.Paris, September 2013. DOI 10.1007/s10766-013-0266-5

[5] C. Brown, K. Hammond, M. Danelutto, and P. Kilpatrick. A Language-Independent Parallel Refactoring Framework. in Proc. of the FifthWorkshop on Refactoring Tools (WRT ’12)., Pages 54-58. ACM, NewYork, USA. 2012.

[6] C. Brown, H. Li, and S. Thompson. An Expression Processor: A CaseStudy in Refactoring Haskell Programs. Eleventh Symp. on Trends inFunc. Prog., May 2010.

[7] C. Brown, H. Loidl, and K. Hammond. Paraforming: Forming HaskellPrograms using Novel Refactoring Techniques. 12th Symp. on Trendsin Func. Prog., Spain, May 2011.

[8] C. Brown, K. Hammond, M. Danelutto, P. Kilpatrick, H. Schöner,and T. Breddin. Paraphrasing: Generating Parallel Programs UsingRefactoring. In 10th International Symposium, FMCO 2011. Turin,Italy, October 3-5, 2011. Revised Selected Papers. Springer-Berlin-Heidelberg. Pages 237-256.

[9] R. M. Burstall and J. Darlington. A Transformation System forDeveloping Recursive Programs. J. of the ACM, 24(1):44–67, 1977.

[10] M. Cole. Algorithmic Skeletons: Structured Management of ParallelComputations. Research Monographs in Par. and Distrib. Computing.Pitman, 1989.

[11] M. Cole. Bringing Skeletons out of the Closet: A Pragmatic Manifestofor Skeletal Parallel Programming. Par. Computing, 30(3):389–406,2004.

[12] D. Dig. A Refactoring Approach to Parallelism. IEEE Softw., 28:17–22,January 2011.

[13] Jerome Friedman, Trevor Hastie, and Robert Tibshirani. Sparse In-verse Covariance Estimation with the Graphical Lasso. Biostatistics,9(3):432–441, July 2008.

[14] R. Loogen, Y. Ortega-Mallén, and R. Peña-Marí. Parallel Func. Prog.in Eden. J. of Func. Prog., 15(3):431–475, 2005.

[15] T. Mens and T. Tourwé. A Survey of Software Refactoring. IEEETrans. Softw. Eng., 30(2):126–139, 2004.

[16] H. Partsch and R. Steinbruggen. Program Transformation Systems.ACM Comput. Surv., 15(3):199–236, 1983.

[17] K. Hammond, M. Aldinucci, C. Brown, F. Cesarini, M. Danelutto,H. Gonzalez-Velez, P. Kilpatrick, R. Keller, T. Natschlager, andG. Shainer. The ParaPhrase Project: Parallel Patterns for AdaptiveHeterogeneous Multicore Systems. FMCO. Feb. 2012.

[18] K. Hammond, J. Berthold, and R. Loogen. Automatic Skeletonsin Template Haskell. Parallel Processing Letters, 13(3):413–424,September 2003.

[19] W. Opdyke. Refactoring Object-Oriented Frameworks. PhD Thesis,Dept. of Comp Sci, University of Illinois at Urbana-Champaign, Cham-paign, IL, USA (1992).

[20] T. Sheard and S. P. Jones. Template Meta-Programming for Haskell.SIGPLAN Not., 37:60–75, December 2002.

[21] D. B. Skillicorn and W. Cai. A Cost Calculus for Parallel FunctionalProgramming. J. Parallel Distrib. Comput., 28(1):65–83, 1995.

[22] J. Wloka, M. Sridharan, and F. Tip. Refactoring for reentrancy. InESEC/FSE ’09, pages 173–182, Amsterdam, 2009. ACM.

Image Convolution – 20 Cores!

19

0 1 2 4 6 8 10 12 14 16 18 20 22 24

1

4

8

12

16

20

24

Nr. threads

Spee

dup

Speedups for Image Convolution on titanic

0 1 2 4 6 8 10 12

1

4

8

12

Nr. threads

Speedups for Image Convolution on xookik

0 20 40 60 80 100 120 140 160

14

8

12

16

20

24

28

32

36

Nr. threads

Speedups for Image Convolution on power8

OpenMP (s | m)

OpenMP (m | m)OpenMP m

TBB (s | m)

TBB (m | m)TBB m

FF (s | m)

FF (m | m)FF m

Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.

1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }

Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:

seq (solve) ; pick best ; seq (update)

where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:

1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)

Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for

cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.

VI. RELATED WORK

Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).

Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.

In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.

VII. CONCLUSIONS AND FUTURE WORK

In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can

0 1 2 4 6 8 10 12 14 16 18 20 22 24

1

4

8

12

16

20

24

Nr. threads

Spee

dup

Speedups for Image Convolution on titanic

0 1 2 4 6 8 10 12

1

4

8

12

Nr. threads

Speedups for Image Convolution on xookik

0 20 40 60 80 100 120 140 160

14

8

12

16

20

24

28

32

36

Nr. threads

Speedups for Image Convolution on power8

OpenMP (s | m)

OpenMP (m | m)OpenMP m

TBB (s | m)

TBB (m | m)TBB m

FF (s | m)

FF (m | m)FF m

Fig. 10. Image Convolution speedups on titanic, xookik and power. Here, | is a parallel pipeline, m is a parallel map and s is a sequential stage.

1 for (j=0; j<num iter; j++) {2 for (i=0; i<num ants; i++)3 cost[i ] = solve (i, p,d,w,t);4 best t = pick best(&best result);5 for (i=0; i<n; i++)6 t [i ] = update(i, best t, best result);7 }

Since pick best in Line 4 cannot start until all of the ants havecomputed their solutions, and the for loop that updates t cannotstart until pick best finishes, we have implicit ordering in thecode above. Therefore, the structure can be described in theRPL with:

seq (solve) ; pick best ; seq (update)

where ; denotes the ordering between computations. Due to anordering between solve, pick best and update, the only way toparallelise the sequential code is to convert seq (solve) and/orseq (update) into maps. Therefore, the possible parallelisationsare:

1) map (solve) ; pick best ; update2) solve ; pick best ; map (update)3) map (solve) ; pick best ; map (update)

Since solve dominates computing time, we are going to con-sider only parallelisations 1) and 3). Speedups for these twoparallelisations, on titanic, xookik and power, with a varyingnumber of CPU threads used, are given in Figure 11. In thefigure, we denote map by m and a sequential stage by s.Therefore, (m ; s ; s) denotes that solve is a parallel map,pick best is sequential and update is also sequential. From theFigure 11, we can observe (similarly to the Image Convolutionexample) that speedups are similar for all parallel libraries.The only exception is Fastflow on power, which gives slightlybetter speedup than other libraries. Furthermore, both differentparallelisations give approximately the same speedups, withthe (m ; s ; m) parallelisation using more resources (threads)altogether. This indicates that it is not always the best ideato parallelise everything that can be parallelised. Finally, wecan note that none of the libraries is able to achieve linearspeedups, and on each system speedups tail off after certainnumber of threads is used. This is due to a fact that a lot ofdata is shared between threads and data-access is slower for

cores that are farther from the data. The maximum speedupsachieved are 12, 11 and 16 on titanic, xookik and power,respectively.

VI. RELATED WORK

Early work in refactoring has been described in [22]. A goodsurvey (as of 2004) can be found in [17]. There has sofar been only a limited amount of work on refactoring forparallelism [4]. In [5], a parallel refactoring methodology forErlang programs, including a refactoring tool, is introducedfor Skeletons in Erlang. Unlike the work presented here, thetechnique is limited to Erlang and does not evaluate reductionsin development time. Other work on parallel refactoring hasmostly considered loop parallelisation in Fortran [21] and Java[9]. However, these approaches are limited to concrete andsimple structural changes (e.g. loop unrolling).

Parallel design patterns are provided as algorithmic skele-tons in a number of different parallel programming frameworks[11] and several different authors advocated the massive usageof patterns for writing parallel applications [16], [20] afterthe well-known Berkeley report [6] indicated parallel designpatterns as a viable way to solve the problems related tothe development of parallel applications with traditional (lowlevel) parallel programming frameworks.

In the algorithmic skeleton research frameworks, thereis a lot of work on improving extra-functional features ofparallel programs by using pattern rewriting rules [15], [2],[13]. We use these rules to support design space explorationin our system. Other authors use rewriting/refactoring to sup-port efficient code generation from skeletons/patterns [12],which is a similar concept w.r.t. our approach. Finally, [14]proposed a “parallel” embedded DSL exploiting annotations,differently from what we use here, which is an external DSL.Other authors proposed to adopt DSL approaches to parallelprogramming [18], [25] similarly to what we propose here,although the DSL proposed is an embedded DSL and mostlyaims at targeting heterogenous CPU/GPU hardware.

VII. CONCLUSIONS AND FUTURE WORK

In this paper, we presented a high-level domain-specific lan-guage, Refactoring Pattern Language (RPL), that can be usedto concisely and efficiently capture parallel patterns, and there-fore describe the parallel structure of an application. RPL can

ParaFormance…

• Saves time and money• Gets products faster to market• De-risks for multi-core • Requires less specialised software teams• Increases developer team productivity• Produces reliable software/products• Allows more easily maintained projects

Why give ParaFormance a free trial today?

[email protected]

www.paraformance.com