Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture...

93
www.idris.fr Institut du développement et des ressources en informatique scientifique Introduction to OpenACC and OpenMP GPU Pierre-François Lavallée, Thibaut Véry

Transcript of Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture...

Page 1: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

wwwidrisfr Institut du deacuteveloppement et des ressources en informatique scientifique

Introduction to OpenACC andOpenMP GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery Pierre-Franccedilois Lavalleacutee Thibaut Veacutery

1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging4 Annex Optimization example5 Annex OpenACC 27OpenMP 50 translation table6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 2 77

1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 3 77

Introduction - Useful definitions

bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 4 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =

32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32

times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80

times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 2: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging4 Annex Optimization example5 Annex OpenACC 27OpenMP 50 translation table6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 2 77

1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 3 77

Introduction - Useful definitions

bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 4 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =

32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32

times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80

times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 3: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 3 77

Introduction - Useful definitions

bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 4 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =

32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32

times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80

times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 4: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Useful definitions

bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 4 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =

32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32

times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80

times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 5: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =

32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32

times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80

times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 6: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32

times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80

times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 7: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80

times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 8: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 5 77

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 9: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 6 77

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 10: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 7 77

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 11: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 8 77

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 12: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Architecture CPU-GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 9 77

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 13: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 10 77

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 14: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Execution model OpenACC

OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 11 77

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 15: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 12 77

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 16: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Execution model

NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 13 77

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 17: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Execution model

ACC parallel (starts gangs)

ACC loop worker

ACC loop vector

ACC parallel (starts gangs)

ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 14 77

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 18: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 15 77

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 19: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 16 77

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 20: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 17 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 21: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 18 77

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 22: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

PGI Compiler - Compiler information

Information available with -Minfo=accel

$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo

exemplesreductionf90

To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 19 77

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 23: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 20 77

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 24: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 21 77

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 25: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 22 77

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 26: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 23 77

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 27: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Analysis - Graphical profiler NVVP

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 24 77

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 28: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

1 Introduction

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 25 77

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 29: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 26 77

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 30: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 27 77

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 31: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 28 77

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 32: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Offloaded execution - Directive serial

$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 29 77

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 33: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 30 77

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 34: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end kerne ls

exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 31 77

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 35: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 32 77

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 36: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Execution offloading - parallel directive

$ACC p a r a l l e ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0

do r =1 rowsdo c=1 co ls

c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 37: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Execution offloading - parallel directive

acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 33 77

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 38: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 34 77

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 39: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 35 77

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 40: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 36 77

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 41: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Work sharing - Loops

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end p a r a l l e l

exemplesgol_parallel_loopf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 37 77

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 42: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Work sharing - Loops

serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations

bull Active threads 10bull Number of operations nx

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 38 77

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 43: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWSVPf90

bull Active threads 10 workers

bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength

bull Number of operations nx

Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength

bull Number of operations nx

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 44: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers

bull Active threads 10 workers

bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker

bull Active threads 10 vectorlength

bull Number of operations 10 nx

Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end p a r a l l e l

exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework

bull Active threads 10 workers vectorlength

bull Number of operations 10 nx

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 45: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Work sharing - Loops

Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 41 77

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 46: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2

enddo$ACC end kerne ls

The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 42 77

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 47: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 43 77

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 48: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Reductions example

$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 44 77

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 49: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 45 77

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 50: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Memory accesses and coalescing

$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90

$acc p a r a l l e l20 $acc loop gang

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do

end do $acc end p a r a l l e l

exemplesloop_coalescingf90

Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 46 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 51: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 52: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 53: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 54: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 55: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Memory accesses and coalescing

Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 47 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 56: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 57: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

19 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutinef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 48 77

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 58: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 49 77

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 59: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 50 77

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 60: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 61: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Shape of arrays

Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 51 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 62: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 63: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 52 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 64: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 65: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo

10 $ACC end p a r a l l e l

$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1

enddo$ACC end p a r a l l e l

20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

OptimalNo You have to move the beginning of the dataregion before the first loop

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 53 77

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 66: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 67: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 68: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 69: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device

Optimize data transfersParallel region(+ Data region)

Parallel region(+ Data region)Data region

A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 54 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 70: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 71: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 55 77

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 72: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 56 77

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 73: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 57 77

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 74: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 58 77

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 75: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 59 77

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 76: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 60 77

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 77: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop async ( 3 )do i =1 s

c ( i ) = 114 enddo

exemplesasync_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 61 77

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 78: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )

$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 62 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 79: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 63 77

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 80: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 64 77

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 81: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Debugging - PGI autocompare

bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 65 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 82: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 66 77

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 83: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 67 77

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 84: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddo$ACC p a r a l l e l loop

40do r =1 rowsdo c=1 co ls

neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo

exemplesoptigol_opti2f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 68 77

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 85: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows

c e l l s = c e l l s + wor ld ( r c )enddo

enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data

exemplesoptigol_opti3f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 5 s

Transfers HrarrD and DrarrH have been optimized

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 69 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 86: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 70 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 87: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 71 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 88: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops

X Does not exist inOpenMP

loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 72 77

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 89: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 73 77

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 90: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i

end do $acc end p a r a l l e l

exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker

do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD

do i =1 nxA( i )=114_8lowast i

end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop2DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 74 77

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 91: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

OpenACC

$acc p a r a l l e l10 $acc loop gang

do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx

A( i j )=114_8end do

end doend do

20 $acc end p a r a l l e l

exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE

exemplesloop3DOMPf90

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 75 77

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 92: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

1 Introduction

2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 76 77

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts
Page 93: Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Definitions Short history Architecture Execution models Porting strategy PGI compiler Profiling the code 2 Parallelism

Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 77 77

  • Introduction
    • Definitions
    • Short history
    • Architecture
    • Execution models
    • Porting strategy
    • PGI compiler
    • Profiling the code
      • Parallelism
        • Offloaded Execution
          • Add directives
          • Compute constructs
          • Serial
          • KernelsTeams
          • Parallel
            • Work sharing
              • Parallel loop
              • Reductions
              • Coalescing
                • Routines
                  • Routines
                    • Data Management
                    • Asynchronism
                      • GPU Debugging
                      • Annex Optimization example
                      • Annex OpenACC 27OpenMP 50 translation table
                      • Contacts