Scaling hybrid coarray/MPI miniapps on Archer

L. Cebamanos1, A. Shterenlikht2, J.D. Arregui-Mena3,L. Margetts3

1Edinburgh Parallel Computing Centre (EPCC)The University of Edinburgh, King's Buildings, Edinburgh EH9 3FD, UK

Email: l.cebamanos@epcc.ed.ac.uk

2Department of Mechanical EngineeringThe University of Bristol, Bristol BS8 1TR, UK, Email: mexas@bris.ac.uk

3School of Mechanical, Aero and Civil EngineeringThe University of Manchester, Manchester M13 9PL, UK

Emails: jose.arregui-mena@manchester.ac.uk, Lee.Margetts@manchester .ac.uk

CUG2016, London 8-12-MAY-2016

CGPACK - cellular automata microstructure simulationlibrary: https://sourceforge.net/projects/cgpack

I Solidi�cation,recrystallisation, andfracture ofpolycrystallinemicrostructures.

I Fortran 2008 coarrays+ TS 18508 [1]extensions.

I HECToR, ARCHER,Intel,OpenCoarrays/GCCsystems.

I BSD license f100g and f110g micro-cracks in individual

crystals merge into a macro-crack.

CGPACK design

I CA spacecoarray - 4Darray, 3Dcorank -structured grid[2, 3, 4].

I Integer cellstates

I Fixed orself-similarboundaries

I Traditionalhalo exchange

CGPACK space coarray:integer, allocatable :: space(:,:,:,:)[:,:,:]

I Discrete space,discrete time

I Meshindependentresults require� 105 CA cellsper crystal onaverage [5].

I Crystal (grain)is a cluster ofcells of thesame value.

CGPACK IO - unresolvedI MPI/IO speeds

up to 2.3GB/son HECToR(Cray XE6) [6].

I MPI/IO canreach 14GB/son ARCHER(Cray XC30)[7].

I NetCDF (notyetimplemented) -higher level ofabstraction,sits on top ofMPI/IO. [8].

106 grains, 1011 cells - 400GB dataset, > 4 hours on 1000 ARCHER nodes (24k cores).

CGPACK scalingI Up to 32k

cores onHECToR andARCHER forsolidi�cationproblems.

I Scaling variesfor di�erentprograms builtwith CGPACK,depending onwhich routinesare called, inwhat order andrequirementsfor synchroni-sation.

8 64 512 4096 32768

speed-u

Number of cores, Hector XE6

sync allsync images serial

sync images d&cco_sum

ParaFEM - scalable general purpose �nite element libraryI http://

parafem.

org.uk

I Fortran 90MPI

I Highlyportable,many users[9]

I Excellentscaling

I BSD license

10 100 1000 10000 100000

Number of MPI processes

Actual

Cellular Automata Finite Element (CAFE)

I Used for solidi�cation[10], recrystallisation[11] and fracture[12, 13].

I FE - continuummechanics - stress,strain, etc.

I CA - crystals, crystalboundaries, cleavage,grain boundaryfracture

I FE ! CA - stress,strain

I CA ! FE - damagevariables

CAFE design: structured CA grid + unstructured FE grid

body (domain)

material

multi−scale model

FE + CA = CAFE

multi−scale model

image 2

image 4

image 3

image 1

Example with 4 PE (4 MPI pro-cesses, 4 coarray images). Arrowsare FE $ CA comms.

PE 2PE 1

FE ! CA mapping via a private allocatable array ofderived type:

type mceni n t e g e r : : image , elnumr e a l : : c e n t r (3 )

end type mcentype ( mcen ) , a l l o c a t a b l e : : l c e n t r ( : )

based on coordinates of FE centroids calculated by each MPIprocess and stored in centroid tmp coarray:

type r car e a l , a l l o c a t a b l e : : r ( : , : )

end type r catype ( r ca ) : : c en t r o i d tmp [ � ]:

a l l o c a t e ( c en t r o i d tmp%r (3 , n e l s p p ) )

where nels pp is the number of FE stored on this PE.

lcentr arrays on images P and Q

PE, image, MPI process P

elements

. . . Q . . . P . . .

lcentr

. . . n . . . b . . .

. . . r . . . s . . .

PE, image, MPI process Q

elements

. . . Q . . . P . . .

lcentr

. . . m . . . a . . .

. . . u . . . t . . .

All-to-all vs nearest neighbour for lcentr

I cgca pfem cenc - all-to-all routine.

I cgca pfem map - nearest neighbour - temporary arrays andcoarray collectives CO SUM and CO MAX, described in TS18508 [1] and will be included in the next revision of theFortran standard, Fortran 2015. At the time of writing coarraycollectives are available on Cray systems as extension to thestandard [14]. The two routines di�er in their use of remotecommunications.

cgca pfem map

i n t e g e r : : maxfe , s t a r t , pend , c tmps i z er e a l , a l l o c a t a b l e : : tmp ( : , : )! Ca lc . the max num . o f FE s t o r e d on t h i s img

maxfe = s i z e ( c en t r o i d tmp%r , dim=2 )c tmps i z e = maxfec a l l co max ( s ou r c e = maxfe )ga l l o c a t e ( tmp( maxfe�num images ( ) , 5 ) , s ou r c e =0.0)! Each image w r i t e s to a un ique p o r t i o n o f tmp

s t a r t = ( t h i s ima g e ( ) � 1)�maxfe + 1pend = s t a r t + c tmps i z e � 1tmp( s t a r t : pend ,1)= r e a l ( t h i s ima g e ( ) , k i nd=4)! Wr i te e l ement number � as r e a l �

tmp( s t a r t : pend , 2 ) = &r e a l ( (/ ( j , j = 1 , c tmps i z e ) /) , k i nd=4 )

! Wr i te c e n t r o i d coordtmp( s t a r t : pend , 3 : 5 ) = &

t r a n s p o s e ( c en t r o i d tmp%r ( : , : ) )c a l l co sum ( sou r c e = tmp )

Initial CAFE scaling

100 1000 10000 100000

scalin

runtimescaling

ParaFEM/CGPACK MPI/coarray miniapp scaling on ARCHERXC30 for a 3D problem with 1M FE and 800M CA cells.

Initial pro�ling

Pro�ling function distribution for ParaFEM/CGPACK MPI/coarrayminiapp with all-to-all routine cgca gcupda at 7200 cores.

Initial pro�ling

Raw pro�ling data for ParaFEM/CGPACK MPI/coarray miniappwith all-to-all routine cgca gcupda at 7200 cores.

cgca gcupda - all-to-all

i n t e g e r : : gcupd ( 1 0 0 , 3 ) [ � ] , r nd i n t , j , &img , g c u p d l o c a l (100 ,3 )

r e a l : : rnd:c a l l random number ( rnd )r n d i n t = i n t ( rnd �num images ( ) ) + 1do j = rnd i n t , r n d i n t + num images ( ) � 1img = ji f ( img . gt . num images ( ) ) &

img = img � num images ( )i f ( img . eq . t h i s ima g e ( ) ) c y c l e:g c u p d l o c a l ( : , : ) = gcupd ( : , : ) [ img ]:

end do

cgca gcupdn - nearest neighbour

do i = �1 , 1do j = �1 , 1do k = �1 , 1! Get the co i ndex s e t o f the ne i ghbou rncod = mycod + (/ i , j , k /):g c u p d l o c a l ( : , : ) = &

gcupd ( : , : ) [ ncod (1 ) , ncod (2 ) , ncod ( 3 ) ]:

end doend doend do

Note: the nearest neighbour must be called multiple times topropagate changes from every image to all other images.

Pro�ling cgca gcupdn

Pro�ling function distribution for ParaFEM/CGPACK MPI/coarrayminiapp with the neareast neighbour routine cgca gcupdn at 7200cores.

Pro�ling cgca gcupdn

Raw pro�ling data for ParaFEM/CGPACK MPI/coarray miniappwith the neareast neighbour routine cgca gcupdn at 7200 cores.

Scaling improvement with cgca gcupdn over cgca gcupda

100 1000 10000 100000

scalin

cgca_gcupda runtimecgca_gcupdn runtimecgca_gcupdn scaling

Runtimes and scaling for ParaFEM/CGPACK MPI/coarrayminiapp with the nearest neighbour, cgca gcupdn, and all-to-all,cgca gcupda, algorithms.Scaling limit increased from 2k to 7k cores.

Pro�ling with cgca pfem map

Pro�ling function distribution for ParaFEM/CGPACK MPI/coarrayminiapp with cgca gcupdn and cgca pfem map at 7200 cores.

Raw pro�ling datafor ParaFEM/CG-PACK MPI/coarrayminiapp withcgca gcupdn andcgca pfem map at7200 cores.

100 1000 10000 100000

scalin

_map runtime_cenc runtime_map scaling

Runtimes and scaling for ParaFEM/CGPACK MPI/coarrayminiapp with cgca pfem map and cgca pfem cenc.cgca pfem map or cgca pfem cenc are called only once during theexecution of the miniapp. Hence only a minor improvement isobtained, only from about 1000 cores.

Issues with CrayPAT

cgca gcupda is top in sampling results, but is absent from tracing.It is called the same number of times as cgca hxi.

Issues with CrayPAT

All pro�ling was done with single thread.

Incorrect number of threads indenti�ed by CrayPAT in a tracingexperiment of ParaFEM/CGPACK MPI/coarray miniapp withcgca gcupda.

Future work - optimisation of coarray synchronisation

! ===>>> i m p l i c i t sync a l l i n s i d e <<<===c a l l c g c a s l d ( cgca space , . f a l s e . , 0 , 10 , c g c a s o l i d )c a l l c g c a i g b ( cgca spac e )c a l l cgca gbs ( cg ca spac e )c a l l c g c a h x i ( cg ca spac e )sync a l lc a l l cgca gcu ( cgca spac e )sync a l l

I Some routines have sync inside.

I Other sync responsibility is left to the end user.

I Over-synchronisation?

I Enough sync is required by the standard. A standardconforming Fortran coarray program will not deadlock orsu�er races.

ISO/IEC JTC1/SC22/WG5 N2074, TS 18508 Additional Parallel Features in

Fortran, 2015.

A. Shterenlikht and L. Margetts, \Three-dimensional cellular automata modellingof cleavage propagation across crystal boundaries in polycrystallinemicrostructures," Proc. Roy. Soc. A, vol. 471, p. 20150039, 2015. [Online].Available: http://eis.bris.ac.uk/�mexas/pub/2015prsa.pdf

A. Shterenlikht, L. Margetts, L. Cebamanos, and D. Henty, \Fortran 2008coarrays," ACM Fortran Forum, vol. 34, pp. 10{30, 2015. [Online]. Available:http://eis.bris.ac.uk/�mexas/pub/2015acm�.pdf

A. Shterenlikht, \Fortran coarray library for 3D cellular automata microstructuresimulation," in Proc. 7th PGAS Conf., 3-4 October 2013, Edinburgh, Scotland,

UK, M. Weiland, A. Jackson, and N. Johnson, Eds. The University ofEdinburgh, 2014, pp. 16{24. [Online]. Available:http://www.pgas2013.org.uk/sites/default/�les/pgas2013proceedings.pdf

J. Phillips, A. Shterenlikht, and M. J. Pavier, \Cellular automata modelling ofnano-crystalline instability," in Proc. 20th UK ACME Conf. 27-28 March 2012,

The University of Manchester, UK, 2012. [Online]. Available:http://eis.bris.ac.uk/�mexas/pub/2012 ACME.pdf

A. Shterenlikht, \Fortran 2008 coarrays," Invited talk at Fortran specialist groupmeeting, BCS/IoP joint meeting, 2014. [Online]. Available:http://eis.bris.ac.uk/�mexas/pub/coar bcs.pdf

D. Henty, A. Jackson, C. Moulinec, and V. Szeremi, \ Performance of Parallel IOon ARCHER, version 1.1," ARCHER White Papers, 2015. [Online]. Available:

http://archer.ac.uk/documentation/white-papers/parallelIO/ARCHER wp parallelIO.pdf

T. Collins, \Using NETCDF with Fortran on ARCHER , version 1.1," ARCHERWhite Papers, 2016. [Online]. Available: http://archer.ac.uk/documentation/white-papers/fortanIO netCDF/fortranIO netCDF.pdf

I. M. Smith, D. V. Gri�ths, and L. Margetts, Programming the Finite Element

Method, 5th ed. Wiley, 2014.

C. A. Gandin and M. Rappaz, \A coupled �nite element-cellular automatonmodel for the prediction of dendritic grain structures in solidi�cation processes,"Acta Met. and Mat., vol. 42, no. 7, pp. 2233{2246, 1994.

C. Zheng and D. Raabe, \Interaction between recrystallization and phasetransformation during intercritical annealing in a cold-rolled dual-phase steel: Acellular automaton model," Acta Materialia, vol. 61, pp. 5504{5517, 2013.

A. Shterenlikht and I. C. Howard, \The CAFE model of fracture { application toa TMCR steel," Fatigue Fract. Eng. Mater. Struct., vol. 29, pp. 770{787, 2006.

S. Das, A. Shterenlikht, I. C. Howard, and E. J. Palmiere, \A general method forcoupling microstructural response with structural performance," Proc. Roy. Soc.

A, vol. 462, pp. 2085{2096, 2006.

ISO/IEC 1539-1:2010, Fortran { Part 1: Base language, International Standard,2010.

Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on...

Transcript of Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on...

Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on...

Documents

Transcript of Scaling hybrid coarray/MPI miniapps on Archer - cug.org · Scaling hybrid coarray/MPI miniapps on...

EXHIBIT LIST FOR 2020 INSURANCE RATES€¦ · mpi-1: ex. # mpi-2 ex. # mpi-3 ex. # mpi-4 ex. # mpi-5-1 ex. # mpi-5-2 ex. # mpi-6 ex. # mpi-7

Archer T6EUN)_V1_UG.pdf · Archer T6E . AC1300 Wireless Dual Band PCI Express Adapter . Archer T8E . AC1750 Wireless Dual Band PCI Express Adapter . Archer T9E . …

Hands-on: Archer Cray XC30 NPB-MZ-MPI / bt-mz C · Hands-on: Archer Cray XC30 NPB-MZ-MPI / bt-mz_C.8 ... Implemented in 20 or so Fortran77 source modules ! ... ! 1.2 Summary analysis

Hartree Centre High Performance Software Engineering · –(beyond the usual OpenMP, MPI and CUDA courses provided by the likes of ARCHER) •Provide a horizon scan of upcoming technologies

What is [Open] MPI?open]-mpi-2up.pdf2 May 2008 Screencast: What is [Open] MPI? 3 MPI Forum • Published MPI-1 spec in 1994 • Published MPI-2 spec in 1996 Additions to MPI-1 •

Archer D5_V1_UG

Jarak Archer

Archer Harman

Advanced Parallel Programming - Archer · 2017. 3. 28. · 27/02/2017 MPI-IO 4: Miscellaneous 3 Size of File on Disk ... Anton Shterenlikht, proceedings of 7th International Conference

Further coarray features - fortran.bcs.org

Richard Archer

Archer Street

What is [Open] MPI?open]-mpi-1up.pdfMay 2008 Screencast: What is [Open] MPI? 3 MPI Forum • Published MPI-1 spec in 1994 • Published MPI-2 spec in 1996 Additions to MPI-1 • Recently

MPI to Coarray Fortran: Experiences with a CFD Solver for ...downloads.hindawi.com/journals/sp/2017/3409647.pdf · MPI to Coarray Fortran: Experiences with a CFD Solver for Unstructured

Portable, MPI-Interoperable Coarray Fortran

Direction-of-Arrival Estimation with Coarray ESPRIT for ......Direction-of-Arrival Estimation with Coarray ESPRIT for Coprime Array Chengwei Zhou and Jinfang Zhou * College of Information

CAF 2.0: A Next-generation Coarray Fortran

More Coarray Featureskorobkin/tmp/SC10/tutorials/docs/M11/M11_Part2.pdfCodimensions: What They Mean •Images are organised into a logical 2D, 3D, …. grid for that coarray only •A

Parallel Programming with Coarray Fortran · Parallel Programming with Coarray Fortran PRACE Autumn School, October 29th 2010 David Henty, Alan Simpson (EPCC) Harvey Richardson, Bill

Лекция 1 Распределенные вычислительные …OpenACC, OpenMP 4.0 Vectorization (SIMD): SSE/AVX SSE/AVX MPI, Cray Chapel, Shmem, Coarray Fortran, Unified