Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Deﬁnitions Short history Architecture...

wwwidrisfr Institut du deacuteveloppement et des ressources en informatique scientifique

Introduction to OpenACC andOpenMP GPU

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery Pierre-Franccedilois Lavalleacutee Thibaut Veacutery

1 IntroductionDefinitionsShort historyArchitectureExecution modelsPorting strategyPGI compilerProfiling the code

2 ParallelismOffloaded Execution

Add directivesCompute constructsSerialKernelsTeamsParallel

Work sharingParallel loopReductionsCoalescing

RoutinesRoutines

Data ManagementAsynchronism

3 GPU Debugging4 Annex Optimization example5 Annex OpenACC 27OpenMP 50 translation table6 Contacts

Pierre-Franccedilois Lavalleacutee Thibaut Veacutery 2 77


2 Parallelism

3 GPU Debugging

4 Annex Optimization example

5 Annex OpenACC 27OpenMP 50 translation table

6 Contacts


Introduction - Useful definitions

bull Gang(OpenACC)Teams(OpenMP) Coarse-grain parallelismbull Worker(OpenACC) Fine-grain parallelismbull Vector Group of threads executing the same instruction (SIMT)bull Thread Execution entitybull SIMT Single Instruction Multiple Threadsbull Device Accelerator on which execution can be offloaded (ex GPU)bull Host Machine hosting 1 or more accelerators and in charge of execution controlbull Kernel Piece of code that runs on an acceleratorbull Execution thread Sequence of kernels to be executed on an accelerator


Introduction - Comparison of CPUGPU cores

The number of compute cores on machines with GPUs is much greater than on a classicalmachineIDRISrsquo Jean-Zay has 2 partitions

bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 =

32times80times4 = 10240 cores




bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32

times80times4 = 10240 cores




bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80

times4 = 10240 cores




bull Non-accelerated 2times20 = 40 coresbull Accelerated with 4 Nvidia V100 = 32times80times4 = 10240 cores


Introduction - The ways to GPU

Applications

Programming effort and technical expertise

Libraries bull cuBLAS bull cuSPARSE bull cuRAND bull AmgX bull MAGMA

bull Minimum change in the code bull Maximum performance

Directives bull OpenACC bull OpenMP 50

bull Portable simple low intrusiveness

bull Still efficient

Programming languages bull CUDA bull OpenCL

bull Complete rewriting complex bull Non-portable bull Optimal performance


Introduction - Short history

OpenACCbull httpwwwopenaccorgbull Cray NVidia PGI CAPSbull First standard 10 (112011)bull Latest standard 30 (112019)bull Main compilers

minus PGIminus Cray (for Cray hardware)minus GCC (since 57)

OpenMP targetbull httpwwwopenmporgbull First standard OpenMP 45 (112015)bull Latest standard OpenMP 50 (112018)bull Main compilers

minus Cray (for Cray hardware)minus GCC (since 7)minus CLANGminus IBM XLminus PGI support announced


Introduction - Architecture of a converged node on Jean Zay

PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau

Architecture of a converged node on Jean Zay (40 cores 192 GB of memory 4 GPU V100)

208 GBs


Introduction - Architecture CPU-GPU


Introduction - Execution model OpenACC

Execution controlled by the host

Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc

In OpenACC the execution is controlled by the host (CPU) It means that kernels data transfersand memory allocation are managed by the host



OpenACC has 4 levels of parallelism for offloaded executionbull Coarse grain Gangbull Fine grain workerbull Vectorization vectorbull Sequential seq

By specifying clauses within OpenACC directives your code will be run with a combination of thefollowing modes bull Gang-Redundant (GR) All gangs run the same instructions redundantelybull Gang-Partitioned (GP) Work is shared between the gangs (ex $ACC loop gang)bull Worker-Single (WS) One worker is active in GP or GR modebull Worker-Partitioned (WP) Work is shared between the workers of a gangbull Vector-Single (VS) One vector channel is activebull Vector-Partitioned (VP) Several vector channels are active

These modes must be combined in order to get the best performance from the acclerator


Introduction - Execution model

The execution of a kernel uses a set of threads that are mapped on the hardware resources of theaccelerator These threads are grouped within teams of the same size with one master thread perteam (this defines a gang) Each team is spread on a 2D thread-grid (worker x vector) Oneworker is actually a vector of vectorlength threads So the total number of threads is

nbthreads = nbgangs lowast nbworkers lowast vectorlength

Important notesbull No synchronization is possible between

gangsbull The compiler can decide to synchronize the

threads of a gang (all or part of them)bull The threads of a worker run in SIMT mode

(all threads run the same instruction at thesame time for example on NVidia GPUsgroups of 32 threads are formed)



NVidia P100 restrictions on GangWorkerVectorbull The number of gangs is limited to 231 minus 1bull The thread-grid size (ie nbworkers x vectorlength) is limited to 1024bull Due to register limitations the size of the grid should be less than or equal to 256 if the

programmer wants to be sure that a kernel can be launchedbull The size of a worker (vectorlength) should be a multiple of 32bull PGI limitation In a kernel that contains calls to external subroutines (not seq) the size of a

worker is set at to 32



ACC parallel (starts gangs)

ACC loop worker

ACC loop vector


ACC loop worker vector

ACC kernels num_gangs(2) num_workers(5) vector_length(8)

Active threads2

10

80

10

2

2

80

2

80


Introduction - Porting strategy

When using OpenACC it is recommended tofollow this strategy

1 Identify the compute intensive loops(Analyze)

2 Add OpenACC directives (Parallelization)

3 Optimize data transfers and loops(Optimizations)

4 Repeat steps 1 to 3 until everything is onGPU

Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel


PGI compiler

During this training course we will use PGI forthe pieces of code containing OpenACCdirectives

Informationbull Company founded in 1989bull Acquired by NVidia in 2013bull Develops compilers debuggers and

profilers

Compilersbull Latest version 1910bull Hands-on version 1910bull C pgccbull C++ pgc++bull Fortran pgf90 pgfortran

Activate OpenACCbull -acc Activates OpenACC supportbull -ta=ltoptionsgt OpenACC optionsbull -Minfo=accel USE IT Displays information

about compilation The compiler will doimplicit operations that you want to be awareof

Toolsbull nvprof CPUGPU profilerbull nvvp nvprof GUI


PGI Compiler - -ta options

Important options for -ta

Compute capabilityEach GPU generation has increased capabilitiesFor NVidia hardware this is reflected by anumberbull K80 cc35bull P100 cc60bull V100 cc70

For example if you want to compile for V100 -ta=teslacc70

Memory managementbull pinned The memory location on the host is

pinned It might improve data transfersbull managed The memory of both the host and

the device(s) is unified

Complete documentationhttpswwwpgroupcomresourcesdocs1910x86pvf-user-guideindexhtmta


PGI Compiler - Compiler information

Information available with -Minfo=accel

1program loopi n t e g e r a (10000)i n t e g e r i$ACC p a r a l l e l loopdo i =1 10000

6 a ( i ) = ienddo

end program loop

exemplesloopf90

program reduc t ion2 i n t e g e r a (10000)

i n t e g e r ii n t e g e r sum=0$ACC p a r a l l e l loopdo i =1 10000

7 a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

sum = sum + a ( i )12 enddo

end program reduc t ion

exemplesreductionf90




$ACC p a r a l l e l loop2 do i =1 10000

a ( i ) = ienddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

7 sum = sum + a ( i )enddo


To avoid CPU-GPU communication a data region is added

$ACC data create ( a (1 10000) )$ACC p a r a l l e l loop

3 do i =1 10000a ( i ) = i

enddo$ACC p a r a l l e l loop reduc t ion ( + sum)do i =1 10000

8 sum = sum + a ( i )enddo$ACC end data

exemplesreduction_data_regionf90


Analysis - Code profiling

Here are 3 tools available to profile your code

PGI_ACC_TIMEbull Command line toolbull Gives basic information

aboutminus Time spent in kernelsminus Time spent in data

transfersminus How many times a kernel

is executedminus The number of gangs

workers and vector sizemapped to hardware

nvprofbull Command line toolbull Options which can give you

a fine view of your code

nvvppgprofNSightGraphics Profilerbull Graphical interface for

nvprof


Analysis - Code profiling PGI_ACC_TIME

For codes generated with PGI compilers it ispossible to have an automatic profiling of thecode by setting the following environmentvariable PGI_ACC_TIME=1The grid is the number of gangs The block is thesize of one gang ([vectorlength times nbworkers])


Analysis - Code profiling nvprofNvidia provides a command line profiler nvprofIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvprof ltoptionsgt executable_file ltarguments of executable filegt

1 Type Time() Time Ca l l s Avg Min Max NameGPU a c t i v i t i e s 5351 270662s 500 54132ms 448ns 68519ms [CUDA memcpy HtoD ]

3672 185742s 300 61914ms 384ns 92907ms [CUDA memcpy DtoH ]634 32052ms 100 32052ms 31372ms 36247ms gol_39_gpu262 13239ms 100 13239ms 12727ms 13659ms gol_33_gpu

6 081 40968ms 100 40968us 40122us 41725us gol_52_gpu001 35927us 100 35920us 33920us 40640us gol_55_gpu__red

exemplesprofilsgol_opti2prof

Important notesbull Only the GPU is profiled by defaultbull Option ldquo--cpu-profiling onrdquo activates CPU profilingbull Options ldquo--metrics flop_count_dp --metrics dram_read_throughput --metrics

dram_write_throughputrdquo give respectively the number of operations memory readthroughput and memory write throughput for each kernel running on the CPU

bull To get a file readable by nvvp you have to specify ldquo-o filenamerdquo (add -f in case you want tooverwrite a file)


Analysis - Graphical profiler nvvpNvidia also provides a GUI for nvprof nvvpIt gives the time spent in the kernels and data transfers for all GPU regionsCaution PGI_ACC_TIME and nvprof are incompatible Ensure that PGI_ACC_TIME=0 beforeusing the tool

$ nvvp ltoptionsgt executable lt--args arguments of executablegt

or if you want to open a profile generated with nvprof

$ nvvp ltoptionsgt progprof


Analysis - Graphical profiler NVVP


1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives

To activate OpenACC or OpenMP features you need to add directives If the right compileroptions are not set the directives are treated as commentsThe syntax is different for a Fortran or a CC++ code

FortranOpenACC

$ACC d i r e c t i v e ltclauses gt $ACC end d i r e c t i v e

CC++OpenACC

pragma acc d i r e c t i v e ltclauses gt2

Some exampleskernels loop parallel data enter data exit dataIn this presentation the examples are in Fortran


Execution offloading

To run on the device you need to specify one of the 3 following directives (called computeconstruct)

serial

The code runs one singlethread (1 gang with 1 worker ofsize 1) Only one kernel is gen-erated

kernels

The compiler analyzes the codeand decides where parallelismis generatedOne kernel is generated foreach parallel loop enclosed inthe region

parallel

The programmer creates a par-allel region that runs on the de-vice Only one kernel is gener-ated The execution is redun-dant by defaultbull Gang-redundantbull Worker-Singlebull Vector-Single

The programmer has to sharethe work manually


Offloaded execution - Directive serial

The directive serial tells the compiler that theenclosed region of code have to be offloaded tothe GPUInstructions are executed only by one threadThis means that one gang with one worker ofsize one is generatedThis is equivalent to a parallel region with thefollowing parameters num_gangs(1)num_workers(1) vector_length(1)

Data Default behaviorbull Arrays present in the serial region not

specified in a data clause (present copyincopyout etc) or a declare directive areassigned to a copy They are SHARED

bull Scalar variables are implicitly assigned to afirstprivate clause The are PRIVATE

Fortran1 $ACC s e r i a l

do i =1 ndo j =1 n enddo

6do j =1 n enddo

enddo11 $ACC end s e r i a l



$ACC s e r i a ldo g=1 generat ions

do r =1 rowsdo c=1 co ls

35 oworld ( r c ) = wor ld ( r c )enddo

enddodo r =1 rows

do c=1 co ls40 neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0

45 else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

enddo50 c e l l s = 0


c e l l s = c e l l s + wor ld ( r c )enddo

55 enddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end s e r i a l

exemplesgol_serialf90

Testbull Size 1000x1000bull Generations 100bull Elapsed time 123640 s


Execution offloading - Kernels In the compiler we trust

The kernels directive tells the compiler that theregion contains instructions to be offloaded onthe deviceEach loop nest will be treated as an independantkernel with its own parameters (number ofgangs workers and vector size)

Data Default behaviorbull Data arrays present inside the kernels

region and not specified inside a dataclause (present copyin copyout etc) orinside a declare are assigned to a copyclause They are SHARED

bull Scalar variables are implicitly assigned to acopy clause They are SHARED

Fortran$ACC kerne ls

2do i =1 ndo j =1 n enddo

7 do j =1 n enddo

enddo$ACC end kerne ls

12

Important notesbull The parameters of the parallel regions are

independent (gangs workers vector size)bull Loop nests are executed consecutively


Execution offloading - Kernels

$ACC kerne lsdo g=1 generat ions



enddodo r =1 rows





end i fenddo






exemplesgol_kernelsf90

Testbull Size 1000x1000bull Generations 100bull Execution time 1951 s


Execution offloading - parallel directive

The parallel directive opens a parallel region onthe device and generates one or more gangs Allgangs execute redundantly the instructions metin the region (gang-redundant mode) Onlyparallel loop nests with a loop directive mighthave their iterations spread among gangs

Data default behaviorbull Data arrays present inside the region and

not specified in a data clause (presentcopyin copyout etc) or a declare directiveare assigned to a copy clause They areSHARED

bull Scalar variables are implicitely assigned to afirstprivate clause They are PRIVATE

$ACC p a r a l l e l2a = 2 Gangminusredundant

$ACC loop Work shar ingdo i = 1 n enddo

7 $ACC end p a r a l l e l

Important notesbull The number of gangs workers and the vector size are constant inside the region



$ACC p a r a l l e ldo g=1 generat ions



enddodo r =1 rows





end i fenddo





enddo$ACC end p a r a l l e l

exemplesgol_parallelf90

acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)

The sequential code is executed redundantly by allgangsThe compiler option acc=noautopar is activated toreproduce the expected behavior of the OpenACCspecification



acc=autopar activated acc=noautoparTestbull Size 1000x1000bull Generations 100bull Execution time 120467 s (noautopar)bull Execution time 5067 s (autopar)



Parallelization control

The default behavior is to let the compiler decide how many workers are generated and theirvector size The number of gangs is set at execution time by the runtime (memory is usually thelimiting criterion)

Nonetheless the programmer might set thoseparameters inside kernels and parallel directiveswith clausesbull num_gangs The number of gangsbull num_workers The number of workersbull vector_length The vector size

program param2 i n t e g e r a (1000)

i n t e g e r i

$ACC p a r a l l e l num_gangs (10) num_workers ( 1 ) amp$ACCamp vec to r_ leng th (128)

7 p r i n t lowast He l lo I am a gang do i =1 1000

a ( i ) = ienddo

$ACC end p a r a l l e l12end program

exemplesparametres_paralf90

Important notesbull These clauses are mainly useful if the code uses a data structure which is difficult for the

compiler to analyzebull The optimal number of gangs is highly dependent on the architecture Use num_gangs with

care


Work sharing - LoopsLoops are at the heart of OpenACC parallelism The loop directive is responsible for sharing thework (ie the iterations of the associated loop)It could also activate another level of parallelism This point (automatic nesting) is a criticaldifference between OpenACC and OpenMP-GPU for which the creation of threads relies on theprogrammer

Clausesbull Level of parallelism

minus gang the iterations of the subsequent loop are distributed block-wise among gangs (going from GRto GP)

minus worker within gangs the workerrsquos threads are activated (going from WS to WP) and the iterations areshared between those threads

minus vector each worker activates its SIMT threads (going from VS to VP) and the work is shared betweenthose threads

minus seq the iterations are executed sequentially on the acceleratorminus auto The compiler analyses the subsequent loop and decides which mode is the most suitable to

respect dependancies

Work sharing (loopiterations)

Activation of newthreads

loop gang Xloop worker X Xloop vector X X


Work sharing - Loops

Clausesbull Merge tightly nested loops collapse(loops)bull Tell the compiler that iterations are independent independent (useful for kernels)bull Privatize variables private(variable-list)bull Reduction reduction(operationvariable-list)

Notesbull You might see the directive do for compatibility



$ACC p a r a l l e ldo g=1 generat ions$ACC loopdo r =1 rows

35 do c=1 co lsoworld ( r c ) = wor ld ( r c )

enddoenddodo r =1 rows

40 do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp


i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) then45 world ( r c ) = 0

else i f ( neigh == 3) thenworld ( r c ) = 1

end i fenddo

50 enddoc e l l s = 0do r =1 rows

do c=1 co lsc e l l s = c e l l s + wor ld ( r c )

55 enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s


exemplesgol_parallel_loopf90



serial

$acc s e r i a ldo i =1 nx

A( i )=B( i )lowast iend do

15 $acc end s e r i a l

exemplesloop_serialf90

bull Active threads 1bull Number of operations nx

Execution GRWSVS

10 $acc p a r a l l e l num_gangs (10)do i =1 nx

A( i )=B( i )lowast iend do $acc end p a r a l l e l

exemplesloop_GRWSVSf90

bull Redundant execution bygang leaders

bull Active threads 10bull Number of operations 10

nx

Execution GPWSVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gangdo i =1 nx


15 $acc end p a r a l l e l

exemplesloop_GPWSVSf90

bull Each gang executes adifferent block ofiterations



Execution GPWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop gang workerdo i =1 nx



exemplesloop_GPWPVSf90

bull Iterations are sharedamong the active workersof each gang

bull Active threads 10 workers

bull Number of operations nx

Execution GPWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang vec to rdo i =1 nx



exemplesloop_GPWSVPf90


bull Iterations are sharedamong the threads of theworker of all gangs

bull Active threads 10 vectorlength


Execution GPWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop gang worker vec to rdo i =1 nx



exemplesloop_GPWPVPf90

bull Iterations are shared onthe grid of threads of allgangs

bull Active threads 10 workers vectorlength


Execution GRWPVS

10 $acc p a r a l l e l num_gangs (10) $acc loop workerdo i =1 nx



exemplesloop_GRWPVSf90

bull Each gang is assigned allthe iterations which areshared among theworkers


bull Number of operations 10 nx

Execution GRWSVP

10 $acc p a r a l l e l num_gangs (10) $acc loop vec to rdo i =1 nx



exemplesloop_GRWSVPf90

bull Each gang is assigned allthe iterations which areshared on the threads ofthe active worker



Execution GRWPVP

10 $acc p a r a l l e l num_gangs (10) $acc loop worker vec to rdo i =1 nx



exemplesloop_GRWPVPf90

bull Each gang is assigned alliterations and the grid ofthreads distributes thework




Reminder There is no thread synchronization at gang level especially at the end of a loopdirective There is a risk of race condition Such risk does not exist for loop with worker andorvector parallelism since the threads of a gang wait until the end of the iterations they execute tostart a new portion of the code after the loopInside a parallel region if several parallel loops are present and there are some dependenciesbetween them then you must not use gang parallelism

10 $acc p a r a l l e l $acc loop gangdo i =1 nx

A( i )=10 _8end do

15 $acc loop gang reduc t ion ( + somme)do i =nx1minus1

somme=somme+A( i )end do $acc end p a r a l l e l

exemplesloop_pb_syncf90

bull Result sum = 97845213 (Wrong value)

10 $acc p a r a l l e l $acc loop worker vec to rdo i =1 nx

A( i )=10 _8end do

15 $acc loop worker vec to r reduc t ion ( + somme)do i =nx1minus1


exemplesloop_pb_sync_corrf90

bull Result sum = 100000000 (Right value)


Parallelism - Merging kernelsparallel directives and loop

It is quite common to open a parallelregion just before a loop In this case itis possible to merge both directivesinto a kernels loop or parallel loopThe clauses available for this constructare those of both constructs

For example with kernels directive

program fusedi n t e g e r a (1000) b (1000) i$ACC kerne ls ltclauses gt$ACC loop ltclauses gt

5 do i =1 1000a ( i ) = b ( i )lowast2


The cons t ruc t above i s equ iva len t to10 $ACC kerne ls loop ltkerne ls and loop clauses gt

do i =1 1000a ( i ) = b ( i )lowast2

enddoend program fused

exemplesfusedf90


Parallelism - ReductionsA variable assigned to a reduction is privatised for each element of the parallelism level of the loopAt the end of the region an operation is executed among those described in table 1 to get the finalvalue of the variable

Operations

Operation Effect Language(s)+ Sum FortranC(++) Product FortranC(++)max Maximum FortranC(++)min Maximum FortranC(++)amp Bitwise and C(++)| Bitwise or C(++)ampamp Logical and C(++)|| Logical or C(++)

Operation Effect Language(s)iand Bitwise and Fortranior Bitwise ou Fortranieor Bitwise xor Fortranand Logical and Fortranor Logica or Fortran

Reduction operators

RestrictionsThe variable must be a scalar with numerical value C(char int float double _Complex) C++(charwchar_t int float double) Fortran(integer real double precision complex)In the OpenACC specification 27 reductions are possible on arrays but the implementation islacking in PGI (for the moment)


Parallelism - Reductions example








end i fenddo

50 enddoc e l l s = 0$ACC loop reduc t ion ( + c e l l s )do r =1 rows

do c=1 co ls55 c e l l s = c e l l s + wor ld ( r c )

enddoenddop r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo60 $ACC end p a r a l l e l

exemplesgol_parallel_loop_reductionf90


Parallelism - Memory accesses and coalescing

Principlesbull Contiguous access to memory by the threads of a worker can be merged This optimizes the

use of memory bandwidthbull This happens if thread i reaches memory location n and thread i + 1 reaches memory

location n + 1 and so onbull For loop nests the loop which has vector parallelism should have contiguous access to

memory

Example with laquoCoalescingraquo Example with laquonoCoalescingraquo



$acc p a r a l l e l10 $acc loop gang

do i =1 nx $acc loop vec to rdo j =1 nx

A( i j )=114_815 end do

end do $acc end p a r a l l e l

exemplesloop_nocoalescingf90


do j =1 nx $acc loop vec to rdo i =1 nx

A( i j )=114_825 end do


exemplesloop_coalescingf90

$acc p a r a l l e l30 $acc loop gang vec to r

do i =1 nx $acc loop seqdo j =1 nx

A( i j )=114_835 end do



Without Coalescing With CoalescingTps (ms) 439 16

The memory coalescing version is 27 times faster



Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)

gang-vector(i)seq(j)



Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7

gang-vector(i)seq(j)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X


Parallelism - Routines

If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Without ACC routine

program r o u t i n ei n t e g e r s = 10000

3 in teger a l l o c a t a b l e a r ray ( )a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l$ACC loopdo i =1 s

8 c a l l f i l l ( a r ray ( ) s i )enddo$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta ins

13 subrou t ine f i l l ( array s i )i n teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s

18 ar ray ( i j ) = 2enddo

end subrou t ine f i l lend program r o u t i n e

exemplesroutine_wrongf90



If an offloaded region contains a call to a function or a subroutine it is necessary to declare it asroutine It gives the information to the compiler that a device version of the functionsubroutine hasto be generated It is mandatory to set the parallelism level inside the function (seq gang workervector)Correct version

program r o u t i n ei n t e g e r s = 10000in teger a l l o c a t a b l e a r ray ( )

4 a l l o c a t e ( ar ray ( s s ) )$ACC p a r a l l e l copyout ( a r ray )$ACC loopdo i =1 s

c a l l f i l l ( a r ray ( ) s i )9 enddo

$ACC end p a r a l l e lp r i n t lowast a r ray (1 10)conta inssubrou t ine f i l l ( array s i )

14 $ACC r o u t i n e seqin teger i n t e n t ( out ) a r ray ( )i n teger i n t e n t ( i n ) s ii n t e g e r jdo j =1 s



exemplesroutinef90


Data on device - Opening a data region

There are several ways of making data visible on devices by opening different kinds of dataregions

Computation offloadingOffloading regions are associated with a localdata regionbull serialbull parallelbull kernels

Global regionAn implicit data region is opened during thelifetime of the program The management of thisregion is done with enter data and exit datadirectives

Local regionsTo open a data region inside a programming unit(function subroutine) use data directive inside acode block

Data region associated toprogramming unit lifetimeA data region is created when a procedure iscalled (function or subroutine) It is availableduring the lifetime of the procedure To makedata visible use declare directive

NotesThe actions taken for the data inside these regions depend on the clause in which they appear


Data on device - Data clauses

H Host D Device variable scalar or array

Data movementbull copyin The variable is copied HrarrD the

memory is allocated when entering theregion

bull copyout The variable is copied DrarrH thememory is allocated when entering theregion

bull copy copyin + copyout

No data movementbull create The memory is allocated when

entering the regionbull present The variable is already on the

devicebull delete Frees the memory allocated on the

device for this variable

By default the clauses check if the variable is already on the device If so no action is takenIt is possible to see clauses prefixed with present_or_ or p for OpenACC 20 compatibility

Other clausesbull no_createbull deviceptr

bull attachbull detach


Data on device - Shape of arrays

To transfer an array it might be necessary to specify its shape The syntax differs between Fortranand CC++

FortranThe array shape is to be specified inparentheses You must specify the first and lastindex

5 We copy a 2d ar ray on the GPU f o r mat r i x a In f o r t r a n we can omit the shape copy ( a )$ACC p a r a l l e l loop copy ( a (1 1000 1 1000) )do i =1 1000

do j =1 100010 a ( i j ) = 0

enddoenddoWe copyout columns 100 to 199 inc luded to the host$ACC p a r a l l e l loop copy ( a ( 1 0 0 1 9 9 ) )

15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo

exemplesarrayshapef90

CC++The array shape is to be specified in squarebrackets You must specify the first index and thenumber of elements

5 We copy the ar ray a by g i v i n g f i r s t element and the s ize o f the ar raypragma acc p a r a l l e l loop copy ( a [ 0 1 0 0 0 ] [ 0 1 0 0 0 ] )f o r ( i n t i =0 i lt 1000 ++ i )

f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0

Copy copy columns 99 to 198pragma acc p a r a l l e l loop copy ( a [ 1 0 0 0 ] [ 9 9 1 0 0 ] )f o r ( i n t i =0 i lt1000 ++ i )

f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec



Restrictionsbull In Fortran the last index of an assumed-size dummy array must be specifiedbull In CC++ the number of elements of a dynamically allocated array must be specified

Notesbull The shape must be specified when using a slicebull If the first index is omitted it is considered as the default of the language (CC++ 0

Fortran1)


Data on device - Parallel regions

Compute constructs serial parallel kernels have a data region associated with variablesnecessary to execution

program parai n t e g e r a (1000)i n t e g e r i l

5$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = i10 enddo

$ACC end p a r a l l e lend program para

exemplesparallel_dataf90

Data region

Parallel region




program parai n t e g e r a (1000)

3 i n t e g e r i l

$ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

8 a ( i ) = ienddo$ACC end p a r a l l e l

do l =1 10013 $ACC p a r a l l e l copy ( a ( 1 0 0 0 ) )

$ACC loopdo i =1 1000

a ( i ) = a ( i ) + 1enddo

18 $ACC end p a r a l l e lenddo

end program para

exemplesparallel_data_multif90

Notesbull Compute region reached 100 timesbull Data region reached 100 times (enter and

exit so potentially 200 data transfers)

Not optimal


Data on device - Local data regions

It is possible to open a data region inside a procedure By doing this you make the variables insidethe clauses visible on the deviceYou have to use the data directive

5 $ACC p a r a l l e l copyout ( a ( 1 0 0 0 ) )$ACC loopdo i =1 1000

a ( i ) = ienddo


$ACC data copy ( a ( 1 0 0 0 ) )do l =1 100

$ACC p a r a l l e l15 $ACC loop

do i =1 1000a ( i ) = a ( i ) + 1


20 enddo$ACC end data

exemplesparallel_data_singlef90

Notesbull Compute region reached 100 timesbull Data region reached 1 time (enter and exit

so potentially 2 data transfers)

Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






OptimalNo You have to move the beginning of the dataregion before the first loop


Data on device

Naive version Parallel region(+ Data region)

Parallel region(+ Data region)

A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create

bull Transfers 8bull Allocations 2


Data on device

Optimize data transfersParallel region(+ Data region)


A

B

C

D

copyin copyin

copyout copyout

copy copy

create create

Lets add a data region


Data on device


Parallel region(+ Data region)Data region

A

B

C

D

copyin copyincopyin

copyout copyout copyout

copy copy copy

create create create

AB and C are now transferred at entry and exit of the data regionbull Transfers 4bull Allocation 1


Data on device



A

B

C

D

copyin present presentupdate device

copyout present present

copy present presentupdate self

create present present

The clauses check for data presence so to make the code clearer you can use the present clause To make sure the updates are done use the updatedirective

bull Transfers 6bull Allocation 1


Data on device - update selfInside a data region you have to use update directive to transfer data It is however impossible touse update inside a parallel region

self (or host)Variables included within self are updated DrarrH

$ACC data copyout ( a )$ACC p a r a l l e l loopdo i =1 1000

a ( i ) = 010 enddo

do j =1 42c a l l random_number ( t e s t )rng = f l o o r ( t e s t lowast100)

15 $ACC p a r a l l e l loop copyin ( rng ) amp$ACCamp copyout ( a )do i =1 1000

a ( i ) = a ( i ) + rngenddo

20 enddo p r i n t lowast before update s e l f a (42) $ACC update s e l f ( a ( 4 2 4 2 ) ) p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l

25 a (42) = 42$ACC end s e r i a lp r i n t lowast before end data a (42)$ACC end datap r i n t lowast a f t e r end data a (42)

exemplesupdate_dir_commentf90

Without update

The a array is not initialized on the host beforethe end of the data region





a ( i ) = 010 enddo




20 enddop r i n t lowast before update s e l f a (42)$ACC update s e l f ( a ( 4 2 4 2 ) )p r i n t lowast a f t e r update s e l f a (42)$ACC s e r i a l


exemplesupdate_dirf90

With update

The a array is initialized on the host after theupdate directive


Data on device - update device

Inside a data region you have to use the update directive to transfer data It is however impossibleto use update in a parallel region

deviceThe variables included inside a device clause areupdated HrarrD


Data on device - Global data regions enter data exit data

Important notesOne benefit of using the enter data and exit datadirectives is to manage the data transfer on thedevice at another place than where they areused It is especially useful when using C++constructors and destructors

1program enterdatai n t e g e r s=10000r e a llowast8 a l l o c a t a b l e dimension ( ) veca l l o c a t e ( vec ( s ) )$ACC enter data create ( vec ( 1 s ) )

6 c a l l compute$ACC e x i t data de le te ( vec ( 1 s ) )conta inssubrou t ine compute

i n t e g e r i11 $ACC p a r a l l e l loop

do i =1 svec ( i ) = 12

enddoend subrou t ine

16end program enterdata

exemplesenter_dataf90


Data on device - Global data regions declare

Important notesIn the case of the declare directive the lifetime ofthe data equals the scope of the code regionwhere it is usedFor example

Zone ScopeModule the programfunction the functionsubroutine the subroutine

module dec larei n t e g e r rows = 1000i n t e g e r co ls = 1000

4 i n t e g e r generat ions = 100$ACC dec lare copyin ( rows cols generat ions )

end module

exemplesdeclaref90


Data on device - Time line

Host

Device

Data transfer to the device Update data DrarrH Exit data region

a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=

copyin(ac) create(b) copyout(d)

a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)


Data on device - Important notes

bull Data transfers between the host and the device are costly It is mandatory to minimize thesetransfers to achieve good performance

bull Since it is possible to use data clauses within kernels and parallel if a data region is openedwith data and it encloses the parallel regions you should use update to avoid unexpectedbehaviorsThis is equally applicable for a data region opened with enter data

Directive updateThe update directive is used to update data either on the host or on the device

Update the value o f a loca ted on host w i th the value on device$ACC update s e l f ( a )

4 Update the value o f a loca ted on device wi th the value on host$ACC update device ( b )

exemplesupdatef90


Parallelism - Asynchronism

By default only one execution thread is createdThe kernels are executed synchronously ieone after the other The accelerator is able tomanage several execution threads runningconcurrentlyTo get better performance it is recommended tomaximize the overlaps betweenbull Computation and data transfersbull kernelkernel if they are independant

Asynchronism is activated by adding theasync(execution thread number) clause to one ofthese directives parallel kernels serial enterdata exit data update and waitIn all cases async is optionalIt is possible to specify a number inside theclause to create several execution threads

Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2

Kernel 1 and transfer async

Kernel 1

Data transfer

Kernel 2

Everything isasync

b i s i n i t i a l i z e d on hostdo i =1 s

b ( i ) = i4 enddo

$ACC p a r a l l e l loop async ( 1 )do i =1 s

a ( i ) = 42enddo

9 $ACC update device ( b ) async ( 2 )


c ( i ) = 114 enddo

exemplesasync_slidesf90


Parallelism - Asynchronism - wait

To have correct results it is likely that someasynchronous kernels need the completion ofother Then you have to add a wait(executionthread number) to the directive that needs towaitA wait is also available

Important notesbull The argument of wait or wait is a list of

integers representing the execution threadnumber For example wait(12) says to waituntil completion of execution threads 1 and2

Kernel 1

Data transfer

Kernel 2

Kernel 2 waitsfor transfer


b ( i ) = i4 enddo


a ( i ) = 42enddo


$ACC p a r a l l e l loop wa i t ( 2 )do i =1 s

b ( i ) = b ( i ) + 114 enddo

exemplesasync_wait_slidesf90


1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Debugging - PGI autocompare

bull Automatic detection of result differences when comparing GPU and CPU executionbull How it works The kernels are run on GPU and CPU if the results are different the execution

is stopped



bull To activate this feature just add autocompare to the compilation(-ta=teslacc60autocompare)

bull Example of code that has a race condition


1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Optimization - Game of LifeFrom now on we are going to optimize Conwayrsquos Game of LifeOur starting point is the version using the parallel directive

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e ldo r =1 rows


enddoenddo$ACC end p a r a l l e l

40 $ACC p a r a l l e ldo r =1 rows

do c=1 co lsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

oworld ( rminus1c ) +oworld ( r +1 c)+amp45 oworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)



50 end i fenddo

enddo$ACC end p a r a l l e l$ACC p a r a l l e l

55do r =1 rowsdo c=1 co ls

exemplesoptigol_opti1f90

Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 34m27 s

The time to solution is increased since the codeis executed sequentially and redundantly by allgangs


Optimization - Game of LifeLets share the work among the active gangs

do g=1 generat ionsc e l l s = 0$ACC p a r a l l e l loopdo r =1 rows


enddoenddo$ACC p a r a l l e l loop


neigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+ampoworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)

45 i f ( oworld ( r c ) == 1 and ( neigh lt2 or neigh gt3) ) thenworld ( r c ) = 0


end i f50 enddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )do r =1 rows


enddoenddo$ACC end p a r a l l e lp r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

60enddo


Infobull Size 10000x5000bull Generations 100bull Serial Time 26 sbull Time 95 s

Loops are distributed among gangs


Optimization - Game of LifeLets optimize data transfers

c e l l s = 0$ACC data copy ( oworld wor ld )do g=1 generat ionsc e l l s =0

5 $ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsoworld ( r c ) = wor ld ( r c )

enddo10enddo

$ACC p a r a l l e l loopdo c=1 co ls

do r =1 rowsneigh = oworld ( rminus1cminus1)+oworld ( r cminus1)+oworld ( r +1 cminus1)+amp

15 oworld ( rminus1c ) +oworld ( r +1 c)+ampoworld ( rminus1c+1)+ oworld ( r c+1)+ oworld ( r +1 c+1)


else i f ( neigh == 3) then20 world ( r c ) = 1

end i fenddo

enddo$ACC p a r a l l e l loop reduc t ion ( + c e l l s )

25 do c=1 co lsdo r =1 rows


enddo30 p r i n t lowast Ce l l s a l i v e a t generat ion g c e l l s

enddo$ACC end data



Transfers HrarrD and DrarrH have been optimized


1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Features OpenACC 27 OpenMP 50DirectivesClauses Notes DirectivesClauses Notes

GPU offloadingPARALLEL GRWSVS mode

- each team exe-cutes instructionsredundantly

TEAMS create teams ofthreads executeredundantly

SERIAL Only 1 thread exe-cutes the code

TARGET GPU offloadinginitial thread only

KERNELS Offloading + au-tomatic paralleliza-tion

X Does not exist inOpenMP

Implicit transferHrarrD or DrarrH

scalar variables non-scalar vari-ables

scalar variablesnon-scalar vari-ables

Explicit transferHrarrD od DrarrHinside explicitparallel regions

Clauses PARAL-LELSERIALKER-NEL

Clause TARGET Data movementdefined from Hostperspective

copyin() copy HrarrD whenentering construct

map(to)

copyout() copy DrarrH whenleaving construct

map(from)

copy() copy HrarrD whenentering and DrarrHwhen leaving

map(tofrom)

create() created on device no transfers

map(alloc)

present()delete()



Data region

declare() Add data to theglobal data region

dataend data inside the sameprogramming unit

TARGET DATAMAP(tofrom)ENDTARGET DATA

inside the sameprogramming unit

enter dataexit data opening and clos-ing anywhere in thecode

TARGET ENTERDATATARGETEXIT DATA

opening and clos-ing anywhere in thecode

update self () update DrarrH UPDATE FROM Copy DrarrH insidea data region

update device () update HrarrD UPDATE TO Copy HrarrD insidea data region

Loopparallelization

kernel offloading + auto-matic paralleliza-tion of suitableloops sync at theend of loops


loop gangwork-ervector

Share iterationsamong gangsworkers and vec-tors

distribute Share iterationsamong TEAMS



routines routine [seqvec-torworkergang]

compile a deviceroutine with speci-fied level of paral-lelism

asynchronism async()wait execute kernel-transfers asyn-chronously andexplicit sync

nowaitdepend manages asyn-chronism et taskdependancies


Some examples of loop nests parallelized with OpenACC and OpenMP

OpenACC

$acc p a r a l l e l10 $acc loop gang worker vec to r

do i =1 nxA( i )=114_8lowast i


exemplesloop1Df90

$acc p a r a l l e l10 $acc loop gang worker


A( i j )=114_815 end do


exemplesloop2Df90

OpenMP

$OMP TARGET10 $OMP TEAMS DISTRIBUTE PARALLEL FOR SIMD


end do$OMP END TEAMS DISTRIBUTE PARALLEL FOR

15 $OMP END TARGET

exemplesloop1DOMPf90

$OMP TARGET TEAMS DISTRIBUTE10 do j =1 nx

$OMP PARALLEL DO SIMDdo i =1 nx

A( i j )=114_8end do

15 $OMP END PARALLEL DO SIMDend do$OMP END TARGET TEAMS DISTRIBUTE



OpenACC


do k=1 nx $acc loop workerdo j =1 nx

$acc loop vec to r15 do i =1 nx


end doend do


exemplesloop3Df90

OpenMP

$OMP TARGET TEAMS DISTRIBUTE10 do k=1 nx

$OMP PARALLEL DOdo j =1 nx

$OMP SIMDdo i =1 nx

15 A( i j k )=114_8end do$OMP END SIMD

end do$OMP END PARALLEL DO

20 end do$OMP END TARGET TEAMS DISTRIBUTE



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts

bull Pierre-Franccedilois Lavalleacutee pierre-francoislavalleeidrisfrbull Thibaut Veacutery thibautveryidrisfrbull Reacutemi Lacroix remilacroixidrisfr


Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging

Annex Optimization example

Annex OpenACC 27OpenMP 50 translation table

Contacts





RoutinesRoutines


3 GPU Debugging4 Annex Optimization example5 Annex OpenACC 27OpenMP 50 translation table6 Contacts



2 Parallelism

3 GPU Debugging



6 Contacts

























Applications

















PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau


208 GBs






Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc























ACC loop worker

ACC loop vector




Active threads2

10

80

10

2

2

80

2

80








Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel


PGI compiler



profilers


















6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


2 Parallelism

3 GPU Debugging



6 Contacts

























Applications

















PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau


208 GBs






Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc























ACC loop worker

ACC loop vector




Active threads2

10

80

10

2

2

80

2

80








Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel


PGI compiler



profilers


















6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts
























Applications

















PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau


208 GBs






Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc























ACC loop worker

ACC loop vector




Active threads2

10

80

10

2

2

80

2

80








Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel


PGI compiler



profilers


















6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts








PLX PLX

900 GBs

16 GBs

128 GBs

25 GBs

124 GBs

1 co

nve

rged

no

de

on

Je

an Z

ay

Ho

st (CP

U)

Device

(GP

U)

Reacuteseau


208 GBs






Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc























ACC loop worker

ACC loop vector




Active threads2

10

80

10

2

2

80

2

80








Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel


PGI compiler



profilers


















6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





Launch kernel1

Launch kernel2

Launch kernel3

CPU

Acc























ACC loop worker

ACC loop vector




Active threads2

10

80

10

2

2

80

2

80








Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel


PGI compiler



profilers


















6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





















ACC loop worker

ACC loop vector




Active threads2

10

80

10

2

2

80

2

80








Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel


PGI compiler



profilers


















6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts







Optimize data

Analyze

Parallelizeloops

Optimize loops

Pick a kernel


PGI compiler



profilers


















6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

PGI compiler



profilers


















6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts













6 a ( i ) = ienddo

end program loop

exemplesloopf90
















3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts









3 do i =1 10000a ( i ) = i















nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts











nvprof






















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





















1 Introduction




RoutinesRoutines


3 GPU Debugging



6 Contacts


Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Add directives


FortranOpenACC


CC++OpenACC






serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts



serial


kernels


parallel











6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts








6do j =1 n enddo







enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





enddodo r =1 rows





end i fenddo
















7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts








7 do j =1 n enddo


12








enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





enddodo r =1 rows





end i fenddo























enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts















enddodo r =1 rows





end i fenddo


















i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts









i n t e g e r i



a ( i ) = ienddo





care

























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts
























end i fenddo








serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


serial






Execution GRWSVS






nx

Execution GPWSVS








Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Execution GPWPVS








Execution GPWSVP









Execution GPWPVP








Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Execution GRWPVS








Execution GRWSVP








Execution GRWPVP











A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




A( i )=10 _8end do






A( i )=10 _8end do










5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





5 do i =1 1000a ( i ) = b ( i )lowast2



do i =1 1000a ( i ) = b ( i )lowast2


exemplesfusedf90



Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


Operations



Reduction operators











end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts









end i fenddo











memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





memory






A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




A( i j )=114_815 end do





A( i j )=114_825 end do





A( i j )=114_835 end do







Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


Without coalescing

gang(i)vector(j)

With coalescing

gang(j)vector(i)




Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


Without coalescing

gang(i)vector(j)T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

With coalescing

gang(j)vector(i)T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


Without coalescing

gang(i)vector(j)T0

T0

X

X

T1

T1

X

X

T2

T2

X

X

T3

T3

X

X

T4

T4

X

X

T5

T5

X

X

T6

T6

X

X

T7

T7

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7



Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


Without coalescing

gang(i)vector(j)T0

T0

X

X

X

X

T1

T1

X

X

X

X

T2

T2

X

X

X

X

T3

T3

X

X

X

X

T4

T4

X

X

X

X

T5

T5

X

X

X

X

T6

T6

X

X

X

X

T7

T7

X

X

X

X

With coalescing

gang(j)vector(i)X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T1

T2

T3

T4

T5

T6

T7

T0

T1

T2

T3

T4

T5

T6

T7


X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

T0

T0

T1

T1

T2

T2

T3

T3

T4

T4

T5

T5

T6

T6

T7

T7

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X

X





















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




















exemplesroutinef90




























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts



























do j =1 100010 a ( i j ) = 0


15do i =1 1000do j =100199

a ( i j ) = 42enddo

enddo




f o r ( i n t j =0 j lt1000 ++ j )10 a [ i ] [ j ] = 0


f o r ( i n t j =99 j lt199 ++ j )15 a [ i ] [ j ] = 42

exemplesarrayshapec





Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




Fortran1)






a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





a ( i ) = i10 enddo



Data region

Parallel region





3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




3 i n t e g e r i l





a ( i ) = a ( i ) + 1enddo


end program para




Not optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1






Optimal





a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




a ( i ) = ienddo




do i =1 1000a ( i ) = a ( i ) + 1








Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Data on device



A

B

C

D

copyin

copyout

copy

create

copyin

copyout

copy

create



Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Data on device



A

B

C

D

copyin copyin

copyout copyout

copy copy

create create



Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Data on device



A

B

C

D

copyin copyincopyin


copy copy copy




Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Data on device



A

B

C

D











a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




a ( i ) = 010 enddo







Without update






a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




a ( i ) = 010 enddo







With update












do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts










do i =1 svec ( i ) = 12










end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts






end module

exemplesdeclaref90



Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


Host

Device


a=0b=0c=0d=0

a=10b=-4c=42d=0

a=10b=c=42d=


a=10b=26c=42

d=158

a=10b=26c=42d=0

update self(b)

a=42b=42c=42

d=158

a=10b=26c=42

d=158

copyout(d)








exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts







exemplesupdatef90





Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts




Kernel 1

Data transfer

Kernel 2

Kernel 1

Data transfer

Kernel 2


Kernel 1

Data transfer

Kernel 2

Everything isasync


b ( i ) = i4 enddo


a ( i ) = 42enddo



c ( i ) = 114 enddo







Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





Kernel 1

Data transfer

Kernel 2



b ( i ) = i4 enddo


a ( i ) = 42enddo



b ( i ) = b ( i ) + 114 enddo



1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts




is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts



is stopped






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts











50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts










50 end i fenddo















end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts









end i f50 enddo




60enddo









enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts





enddo10enddo






end i fenddo





enddo$ACC end data





1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts

















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts
















map(to)


map(from)


map(tofrom)


map(alloc)

present()delete()



Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts


Data region










Loopparallelization














OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts








OpenACC




exemplesloop1Df90



A( i j )=114_815 end do


exemplesloop2Df90

OpenMP




15 $OMP END TARGET








OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

OpenACC





end doend do


exemplesloop3Df90

OpenMP



$OMP SIMDdo i =1 nx






1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

1 Introduction

2 Parallelism

3 GPU Debugging



6 Contacts


Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Contacts

Contacts



Introduction

Definitions

Short history

Architecture

Execution models

Porting strategy

PGI compiler

Profiling the code

Parallelism

Offloaded Execution

Add directives

Compute constructs

Serial

KernelsTeams

Parallel

Work sharing

Parallel loop

Reductions

Coalescing

Routines

Routines

Data Management

Asynchronism

GPU Debugging



Contacts

Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Deﬁnitions Short history Architecture...

Documents

Transcript of Introduction to OpenACC and OpenMP GPU - IDRIS1 Introduction Deﬁnitions Short history Architecture...