Tutorial: Migrating an application to use OPS - GitHub Pages · 2020-04-23 · Tutorial: Migrating...

Tutorial: Migrating an application to use OPS

Istvan Reguly

Pazmany Peter Catholic University, Faculty of Information Technology and Bionics

[email protected]

Apr 12, 2018

Istvan Reguly (PPCU ITK) OPS Tutorial Apr 12, 2018 1 / 24

Overview

1 IntroductionOPS concepts


OPS Abstraction

OPS is a Domain Specific Language embedded in C/C++ andFortran, targeting multi-block structured mesh computations

The abstraction has two distinct components: the definition of themesh, and operations over the mesh

Defining a number of 1-3D 1 blocks, and on them a number ofdatasets, which have specific extents in the different dimensionsDescribing a parallel loop over a given block, with a given iterationrange, executing a given “kernel function” at each grid point, anddescribing what datasets are going to be accessed and how.Additionally, one needs to declare stencils (access patterns) that will beused in parallel loops to access datasets, and any global constants(read-only global scope variables)

Data and computations expressed this way can be automaticallymanaged and parallelised by the OPS library.

1Higher dimensions supported in the backend, but not by the code generators yetIstvan Reguly (PPCU ITK) OPS Tutorial Apr 12, 2018 3 / 24

Example application

Our example application is a simple 2D iterative Laplace equation solver.

Go to the OPS/apps/c/laplace2d tutorial/original directory

Open the laplace2d.cpp file

It uses an imax ∗ jmax grid, with an additional 1 layers of boundarycells on all sides

There are a number of loops that set the boundary conditions alongthe four edges

The bulk of the simulation is spent in a while loop, repeating astencil kernel with a maximum reduction, and a copy kernel

Compile and run the code!


Original - Initialisation

// S i z e a l ong yi n t jmax = 4094 ;// S i z e a l ong xi n t imax = 4094 ;

i n t i t e r max = 100 ;doub l e p i = 2 .0 ∗ a s i n ( 1 . 0 ) ;con s t doub l e t o l = 1 .0 e−6;doub l e e r r o r = 1 . 0 ;

doub l e ∗A;doub l e ∗Anew ;doub l e ∗y0 ;

A = ( doub l e ∗) ma l l o c ( ( imax+2)∗( jmax+2)∗ s i z e o f ( doub l e ) ) ;Anew = ( doub l e ∗) ma l l o c ( ( imax+2)∗( jmax+2)∗ s i z e o f ( doub l e ) ) ;y0 = ( doub l e ∗) ma l l o c ( ( imax+2)∗ s i z e o f ( doub l e ) ) ;

memset (A, 0 , ( imax+2)∗( jmax+2)∗ s i z e o f ( doub l e ) ) ;


Original - Boundary loops

// s e t boundary c o n d i t i o n sf o r ( i n t i = 0 ; i < imax+2; i++)

A[ ( 0 ) ∗( imax+2)+i ] = 0 . 0 ;

f o r ( i n t i = 0 ; i < imax+2; i++)A [ ( jmax+1)∗( imax+2)+i ] = 0 . 0 ;

f o r ( i n t j = 0 ; j < jmax+2; j++){

A[ ( j ) ∗( imax+2)+0] = s i n ( p i ∗ j / ( jmax+1) ) ;}

f o r ( i n t j = 0 ; j < imax+2; j++){

A[ ( j ) ∗( imax+2)+imax+1] = s i n ( p i ∗ j / ( jmax+1) ) ∗ exp(− p i ) ;}

Note how in the latter two loops the loop index is used


Original - Main iteration

wh i l e ( e r r o r > t o l && i t e r < i t e r max ) {e r r o r = 0 . 0 ;f o r ( i n t j = 1 ; j < jmax+1; j++ ) {

f o r ( i n t i = 1 ; i < imax+1; i++) {Anew [ ( j ) ∗( imax+2)+i ] = 0 .25 f ∗ (

A [ ( j ) ∗( imax+2)+i +1] + A [ ( j ) ∗( imax+2)+i −1]+ A [ ( j −1)∗( imax+2)+i ] + A [ ( j +1)∗( imax+2)+i ] ) ;

e r r o r = fmax ( e r r o r , f a b s (Anew [ ( j ) ∗( imax+2)+i ]−A[ ( j ) ∗( imax+2)+i ] ) ) ;

}}f o r ( i n t j = 1 ; j < jmax+1; j++ ) {

f o r ( i n t i = 1 ; i < imax+1; i++) {A[ ( j ) ∗( imax+2)+i ] = Anew [ ( j ) ∗( imax+2)+i ] ;

}}i f ( i t e r % 10 == 0) p r i n t f ( ”%5d , %0.6 f \n” , i t e r , e r r o r ) ;i t e r ++;

}


Building OPS

To build OPS, we need a C/C++ compiler at the minimum, and foradvanced parallelisations we will need an MPI compiler, a CUDAcompiler, and an OpenCL compiler. To support parallel file I/O weneed the parallel HDF5 library as well. We do not require all of theabove to build and use OPS (but don’t be surprised if some buildtargets fail).

The OPS Makefiles rely on a set of environment flags to be setappropriately:

e x p o r t OPS COMPILER=gnu #o r i n t e l , c ray , pg i , c l a n ge x p o r t OPS INSTALL PATH=˜/OPS/ opse x p o r t MPI INSTALL PATH=/opt / openmpi #MPI r o o t f o l d e re x p o r t CUDA INSTALL PATH=/u s r / l o c a l / cuda #CUDA r o o t f o l d e re x p o r t OPENCL INSTALL PATH=/u s r / l o c a l / cuda #OpenCL r o o t f o l d e re x p o r t HDF5 INSTALL PATH=/opt / hdf5 #HDF5 r o o t f o l d e re x p o r t NV ARCH=P a s c a l #i f you use GPUs , t h e i r g e n e r a t i o n

Having set these, you can type make in the ops/c subdirectory


Preparing to use OPS - step 1

First off, we need to include the appropriate header files, then initialiseOPS, and at the end finalise it.

Define that this application is 2D, include the OPS header file, andcreate a header file where the outlined “user kernels” will live

#de f i n e OPS 2D#i n c l u d e <op s s eq . h>#i n c l u d e ” l a p l a c e k e r n e l s . h”

Initialise and finalise OPS

// I n i t i a l i s e the OPS l i b r a r y , p a s s i n g runt ime args , ands e t t i n g d i a g n o s t i c s l e v e l to low (1)

o p s i n i t ( argc , argv , 1 ) ;. . .o p s e x i t ( ) ;

By this point you need OPS set up - take a look at the Makefile instep1, and observer that the include and library paths are added, andwe link against ops seq.


OPS declarations - step 2

Now we can declare a block, and the datasets

op s b l o c k b l o ck = o p s d e c l b l o c k (2 , ”my gr id ” ) ;i n t s i z e [ ] = { imax , jmax } ;i n t base [ ] = {0 ,0} ;i n t d m [ ] = {−1,−1};i n t d p [ ] = {1 ,1} ;op s da t d A = op s d e c l d a t ( b lock , 1 , s i z e , base ,

d m , d p , A, ” doub l e ” , ”A” ) ;op s da t d Anew = op s d e c l d a t ( b lock , 1 , s i z e , base ,

d m , d p , Anew , ” doub l e ” , ”Anew” ) ;

datasets have a size (number of grid points in each dimension). Thereis padding for halos or boundaries in the positive (d p) and negativedirections (d m); here we use a 1 thick boundary layer. Base index canbe defined as it may be different from 0 (e.g. in Fortran).

item these with a 0 base index and a 1 wide halo, these datasets canbe indexed from −1 to size + 1



OPS supports gradual conversion of applications to its API, but inthis case the described data sizes will need to match: the allocatedmemory and its extents need to be correctly described to OPS. In thiscase we have two (imax + 2) ∗ (jmax + 2) size arrays, and the totalsize in each dimension needs to match size[i ] + d p[i ] − d m[i ]. Thisis only supported for the sequential and OpenMP backends

If a NULL pointer is passed, OPS will allocate the data internally



We also need to declare the stencils that will be used - most loops usea simple 1-point stencil, and one uses a 5-point stencil

//Two s t e n c i l s , a 1−po in t , and a 5−po i n ti n t s2d 00 [ ] = {0 ,0} ;o p s s t e n c i l S2D 00 =

o p s d e c l s t e n c i l ( 2 , 1 , s2d 00 , ” 0 ,0 ” ) ;i n t s 2d 5p t [ ] = {0 ,0 , 1 ,0 , −1 ,0 , 0 ,1 , 0 ,−1} ;o p s s t e n c i l S2D 5pt =

o p s d e c l s t e n c i l ( 2 , 5 , s2d 5pt , ”5 pt ” ) ;

Different names may be used for stencils in your code, but we suggestusing some convention


First parallel loop - step 3

It’s time to convert the first loop to use OPS:

f o r ( i n t i = 0 ; i < imax+2; i++)A[ ( 0 ) ∗( imax+2)+i ] = 0 . 0 ;

This is a loop on the bottom boundary of the domain, which is at the−1 index for our dataset, therefore our iteration range will be overthe entire domain, including halos in the X direction, and the bottomboundary in the Y direction. The iteration range is given as beginning(inclusive) and end (exclusive) indices in the x, y, etc. directions.

i n t bot tom range [ ] = {−1, imax+1, −1, 0} ;

Next, we need to outline the “user kernel” into laplace kernels.h,and place the appropriate access macro - OPS ACCN(i , j), where Nis the index of the argument in the kernel’s formal parameter list, and(i , j) are the stencil offsets in the X and Y directions respectively.

vo i d s e t z e r o ( doub l e ∗A) {A[OPS ACC0(0 , 0 ) ] = 0 . 0 ;

}Istvan Reguly (PPCU ITK) OPS Tutorial Apr 12, 2018 13 / 24

First parallel loop - step 3

The OPS parallel loop can now be written as follows:

o p s p a r l o o p ( s e t z e r o , ” s e t z e r o ” , b lock , 2 , bottom range ,o p s a r g d a t ( d A , 1 , S2D 00 , ” doub l e ” , OPS WRITE) ) ;

The loop will execute set zero at each grid point defined in theiteration range, and write the dataset d A with the 1-point stencil

The ops par loop implies that the order in which grid points will beexecuted will not affect the end result (within machine precision)


More parallel loops - step 3

There are three more loops which set values to zero, they can betrivially replaced with the code above, only altering the iterationrange.

In the main while loop, the second simpler loop simply copies datafrom one array to another, this time on the interior of the domain:

i n t i n t e r i o r r a n g e [ ] = {0 , imax , 0 , jmax } ;o p s p a r l o o p ( copy , ” copy” , b lock , 2 , i n t e r i o r r a n g e ,

o p s a r g d a t ( d A , 1 , S2D 00 , ” doub l e ” , OPS WRITE) ,o p s a r g d a t ( d Anew , 1 , S2D 00 , ” doub l e ” , OPS READ) ) ;

And the corresponding outlined user kernel is as follows. Observe howfor Anew the OPS ACC1 macro is used, as it is the second argument

vo i d copy ( doub l e ∗A, con s t doub l e ∗Anew) {A[OPS ACC0(0 , 0 ) ] = Anew [OPS ACC1(0 , 0 ) ] ;

}


Indexes and global constants - step 4

There are two sets of boundary loops which use the loop variable j -this is a common technique to initialise data, such as coordinates(x = i ∗ dx).OPS has a special argument ops arg idx which gives us a globallycoherent (over MPI) iteration index - between the bounds supplied inthe iteration range

i n t l e f t r a n g e [ ] = {−1, 0 , −1, jmax+1};o p s p a r l o o p ( l e f t b nd con , ” l e f t b n d c o n ” , b lock , 2 ,

l e f t r a n g e ,o p s a r g d a t ( d A , 1 , S2D 00 , ” doub l e ” , OPS WRITE) ,o p s a r g i d x ( ) ) ;

And the corresponding outlined user kernel is as follows. Observe theidx argument and the +1 offset due to the difference in indexing:

vo i d l e f t b n d c o n ( doub l e ∗A, con s t i n t ∗ i d x ) {A[OPS ACC0(0 , 0 ) ] = s i n ( p i ∗ ( i d x [1 ]+1) / ( jmax+1) ) ;

}


Indexes and global constants - step 4

This kernel also uses two variables, jmax and pi that do not dependin the iteration index - they are iteration space invariant. OPS hastwo ways of supporting this:

Global scope constants, through ops decl const, as done in this case:we need to move the declaration of the imax, jmax and pi variables toglobal scope (outside of main), and call the OPS API:

// d e c l a r e and d e f i n e g l o b a l c on s t a n t so p s d e c l c o n s t ( ” imax” ,1 , ” i n t ” ,& imax ) ;o p s d e c l c o n s t ( ” jmax” ,1 , ” i n t ” ,&jmax ) ;o p s d e c l c o n s t ( ” p i ” ,1 , ” doub l e ” ,& p i ) ;

These variables do not need to be passed in to the user kernel, they areaccessible in all user kernelsThe other option is to explicitly pass it to the user kernel withops arg gbl: this is for scalars and small arrays that should not be inglobal scope.


Complex stencils and reductions - step 5

There is only one loop left, which uses a 5 point stencil and areduction. It can be outlined as usual, and for the stencil, we will useS2D pt5

o p s p a r l o o p ( a p p l y s t e n c i l , ” a p p l y s t e n c i l ” , b lock , 2 ,i n t e r i o r r a n g e ,o p s a r g d a t ( d A , 1 , S2D 5pt , ” doub l e ” , OPS READ) ,o p s a r g d a t ( d Anew , 1 , S2D 00 , ” doub l e ” , OPS WRITE) ,o p s a r g r e d u c e ( h e r r , 1 , ” doub l e ” , OPS MAX) ) ;

And the corresponding outlined user kernel is as follows. Observe thestencil offsets used to access the adjacent 4 points:

vo i d a p p l y s t e n c i l ( con s t doub l e ∗A, doub l e ∗Anew , doub l e∗ e r r o r ) {

Anew [OPS ACC1(0 , 0 ) ] = 0 .25 f ∗ ( A[OPS ACC0(1 , 0 ) ]+ A[OPS ACC0(−1 ,0) ]+ A[OPS ACC0(0 ,−1) ] + A[OPS ACC0(0 , 1 ) ] ) ;

∗ e r r o r = fmax ( ∗ e r r o r , f a b s (Anew [OPS ACC1(0 , 0 ) ]−A[OPS ACC0(0 , 0 ) ] ) ) ;

}Istvan Reguly (PPCU ITK) OPS Tutorial Apr 12, 2018 18 / 24

Complex stencils and reductions - step 5

The loop also has a special argument for the reduction,ops arg reduce. As the first argument, it takes a reduction handle,which has to be defined separately:

o p s r e d u c t i o n h e r r = o p s d e c l r e d u c t i o n h a n d l e (s i z e o f ( doub l e ) , ” doub l e ” , ” e r r o r ” ) ;

Reductions may be increment (OPS INC), min (OPS MIN) or max(OPS MAX). The user kernel will have to perform the reductionoperation, reducing the passed in value as well as the computed value

The result of the reduction can be queried from the handle as follows:

o p s r e d u c t i o n r e s u l t ( h e r r , &e r r o r ) ;

Multiple parallel loops may use the same handle, and their results willbe combined, until the result is queried by the user. Parallel loopsthat only have the reduction handle in common are semanticallyindependent


Handing it all to OPS - step 6

We have now successfully converted all computations on the grid toOPS parallel loops

In order for OPS to manage data and parallelisations better, we shouldlet OPS allocate the datasets - instead of passing in the pointers tomemory allocated by us, we just pass in NULL (A and Anew)

Parallel I/O can be done using HDF5 - see the ops hdf5.h header

All data and parallelisation is now handed to OPS. We can now alsocompile the developer MPI version of the code - see the Makefile, andtry building laplace2d mpi


Code generation - step 7

Now that the developer versions of our code work, it’s time togenerate code. On the console, type:

$OPS INSTALL PATH / . . / o p s t r a n s l a t o r / c / ops . py l a p l a c e 2 d . cpp

We have provided a makefile which can use several different compilers(intel, cray, pgi, clang), we suggest modifying it for your ownapplication

Try building CUDA, OpenMP, MPI+CUDA, MPI+OpenMP, andother versions of the code

You can take a look at the generated kernels for differentparallelisations under the appropriate subfolders

If you add the −OPS DIAGS = 2 runtime flag, at the end ofexecution, OPS will report timings and achieved bandwidth for eachof your kernels

For more, see the user guide...


Code generated versions

OPS will generate and compile a large number of different versions

laplace2d dev seq and laplace2d dev mpi: these versions do notuse code generation, they are intended for development onlylaplace2d seq and laplace2d mpi: baseline sequential and MPIimplementationslaplace2d openmp: baseline OpenMP implementationlaplace2d cuda, laplace2d opencl, laplace2d openacc:implementations targeting GPUslaplace2d mpi inline: optimised implementation withMPI+OpenMPlaplace2d tiled: optimised implementation with OpenMP thatimproves spatial and temporal locality


Optimisations

Try the following performance tuning options

laplace2d cuda, laplace2d opencl: you can set theOPS BLOCK SIZE X and OPS BLOCK SIZE Y runtime arguments tocontrol thread block or work group sizeslaplace2d mpi cuda, laplace2d mpi openacc: add the-gpudirect runtime flag to enable GPU Direct communicationslaplace2d tiled, laplace2d mpi tiled: add the OPS TILING

runtime flag, and move -OPS DIAGS=3: see the cache blocking tiling atwork. For some applications, such as this one, the initial guess givestoo large tiles, try setting OPS CACHE SIZE to a lower value (in MB, forL3 size). Thread affinity control and using 1 process per socket isstrongly recommended. E.g. OMP NUM THREADS=20 numactl

--cpunodebind=0 ./laplace2d tiled -OPS DIAGS=3 OPS TILING

OPS CACHE SIZE=5. Over MPI, you will have to setOPS TILING MAXDEPTH to extend halo regions.


Optimisations - tiling

Tiling uses lazy execution: as parallel loops follow one another, theyare not executed, but put in a queue, and only once some data needsto be returned to the user (e.g. result of a reduction) do these loopshave to be executed

With a chain of loops queued, OPS can analyse them together andcome up with a tiled execution schedule

Works over MPI too: OPS extends the halo regions, and does one bighalo exchange instead of several smaller ones

In the current laplace2d code, every stencil application loop is alsodoing a reduction, therefore only two loops are queued. Try modifyingthe code so the reduction only happens every 10 iterations!

On A Xeon E5-2650, one can get a 2.5x speedup


Tutorial: Migrating an application to use OPS - GitHub Pages · 2020-04-23 · Tutorial: Migrating...

Documents

Transcript of Tutorial: Migrating an application to use OPS - GitHub Pages · 2020-04-23 · Tutorial: Migrating...