Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe...
Transcript of Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe...
![Page 1: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/1.jpg)
GPU Hackathon 2017- OpenACC
Programming GPUs with OpenACC
1
SaberFekiComputationalScientistLead
SupercomputingCoreLaboratory,KAUST [email protected]
![Page 2: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/2.jpg)
GPU Hackathon 2017- OpenACC
GPU architecture
2
![Page 3: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/3.jpg)
GPU Hackathon 2017- OpenACC
CPU-GPU memory model
3
PCIeInterconnect16X- 8GB/s(gen2)and15.75GB/s(gen3),verythinpipe!KeplerK402,880cudacores1.48Tflops/s
![Page 4: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/4.jpg)
GPU Hackathon 2017- OpenACC
GPU programming
4
![Page 5: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/5.jpg)
GPU Hackathon 2017- OpenACC
OpenACC, the standard
• ByNVIDIA,CRAY,PGIandCAPS• ThestandardwasannouncedinNov2011atSC11conference• http://www.openacc-standard.org• OpenACC2.0releasedinsummer2013• Now,20+partnersfromacademiaandindustry
5
![Page 6: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/6.jpg)
GPU Hackathon 2017- OpenACC
OpenACC advantages
• Easy:Directivesaretheeasypathtoacceleratecomputeintensiveapplications
• Open:OpenACCisanopenGPUdirectivesstandard,makingGPUprogrammingstraightforwardandportableacrossparallelandmulti-coreprocessors
• Powerful:GPUDirectivesallowcompleteaccesstothemassiveparallelpowerofaGPU
6
![Page 7: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/7.jpg)
GPU Hackathon 2017- OpenACC
PGI and CAPS compilers study (I)
7
S.Feki,A.Al-Jarro,H.Bağcı.PortinganExplicitTimeDomainVolumeIntegralEquationSolveronGPUswithOpenACC,IEEEAntennasandPropagationMagazine,July,2014
#pragma acc kernels {for ( l = 0 ; l < nt; ++l) { // time loop#pragma acc loop independent collapse (3)
for (int i = 0; i < n; ++i){ for (int j = 0; j < n; ++j){
for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....
}}
}
#pragma acc loop independent collapse (3)for (int i = 0; i < n; ++i){
for (int j = 0; j < n; ++j){for (int k = 0; k < n; ++k){
B[i][j][k] = B[i]][j][k] + ....}
}}
} // end time loop }
#pragma acc datafor ( l = 0 ; l < nt; ++l) { // time loop#pragma acc kernels#pragma acc loop independent gang
for (int i = 0; i < n; ++i){ #pragma acc loop independent gang,vector
for (int j = 0; j < n; ++j){#pragma acc loop independent gang,vector
for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....
} } }#pragma acc kernels#pragma acc loop independent gang
for (int i = 0; i < n; ++i){ #pragma acc loop independent gang,vector
for (int j = 0; j < n; ++j){#pragma acc loop independent gang,vector
for (int k = 0; k < n; ++k){B[i][j][k] = B[i]][j][k] + ....
} } }} // end time loopCAPS PGI
![Page 8: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/8.jpg)
GPU Hackathon 2017- OpenACC
PGI and CAPS compilers study (II)
8
0
5
10
15
20
25
30
35
6 11 25 32 41 56 77 113 176
Speedup
Numberofdegreesoffreedom(X1000)
CAPSPGI
S.Feki,A.Al-Jarro,H.Bağcı.PortinganExplicitTimeDomainVolumeIntegralEquationSolveronGPUswithOpenACC,IEEEAntennasandPropagationMagazine,July,2014
![Page 9: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/9.jpg)
GPU Hackathon 2017- OpenACC
Directive syntax
• Fortran!$accdirective[clause[,]clause]…]…oftenpairedwithamatchingenddirective!$accenddirective• C#pragmaaccdirective[clause[,]clause]…]Oftenfollowedbyastructuredcodeblock
9
![Page 10: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/10.jpg)
GPU Hackathon 2017- OpenACC
kernels: Your first OpenACC Directive
• Eachloopexecutedasaseparatekernel (aparallelfunctionthatrunsontheGPU)
!$acc kernelsdo i=1,n
a(i) = 0.0 b(i) = 1.0c(i) = 2.0
end dodo i=1,na(i) = b(i) + c(i)
end do !$acc end kernels
10
![Page 11: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/11.jpg)
GPU Hackathon 2017- OpenACC
Compile and run
• C:pgcc–acc[-Minfo=accel]–osaxpy_accsaxpy.c• Fortran:pgf90–acc[-Minfo=accel]–osaxpy_accsaxpy.f90• Compileroutput:[sfeki@c4hdnsaxpy]$pgcc-acc-ta=nvidia-Minfo=accel-osaxpysaxpy.csaxpy:
5,Generatingpresent_or_copyin(x[0:n])Generatingpresent_or_copy(y[0:n])Generatingcomputecapability1.0binaryGeneratingcomputecapability2.0binary
6,LoopisparallelizableAcceleratorkernelgenerated6,#pragmaaccloopgang,vector(128)/*blockIdx.xthreadIdx.x*/CC1.0:8registers;48shared,0constant,0localmemorybytesCC2.0:12registers;0shared,64constant,0localmemorybytes
11
![Page 12: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/12.jpg)
GPU Hackathon 2017- OpenACC
SAXPY example, revisited
12
![Page 13: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/13.jpg)
GPU Hackathon 2017- OpenACC
Jacobi Iteration: C code
13
![Page 14: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/14.jpg)
GPU Hackathon 2017- OpenACC
Jacobi Iteration: OpenACC code
14
![Page 15: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/15.jpg)
GPU Hackathon 2017- OpenACC
PGI Accelerator Compiler output
15
![Page 17: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/17.jpg)
GPU Hackathon 2017- OpenACC
What went wrong ?
17
![Page 18: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/18.jpg)
GPU Hackathon 2017- OpenACC
Excessive data transfer
18
![Page 19: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/19.jpg)
GPU Hackathon 2017- OpenACC
Another way of detecting it: NVIDIA Profiler
• Usenvprof forprofilingtheGPUapplication:
• UseNVVPGUI:NVIDIAVisualProfiler:
19
![Page 20: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/20.jpg)
GPU Hackathon 2017- OpenACC
Data construct
• Fortran!$accdata[clause…]structuredblock
!$accenddata• C#pragmaaccdata[clause…]{structuredblock}• Managedatamovement.Dataregionsmaybenested• GeneralClausesif(condition)async(expression)
20
![Page 21: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/21.jpg)
GPU Hackathon 2017- OpenACC
Data clauses
• copy (list)AllocatesmemoryonGPUandcopiesdatafromhosttoGPUwhenenteringregionandcopiesdatatothehostwhenexitingregion.
• copyin (list)AllocatesmemoryonGPUandcopiesdatafromhosttoGPUwhenenteringregion.
• copyout (list)AllocatesmemoryonGPUandcopiesdatatothehostwhenexitingregion.
• create (list)AllocatesmemoryonGPUbutdoesnotcopy.• present (list)DataisalreadypresentonGPUfromanother
containingdataregion.• andpresent_or_copy[in|out],present_or_create,deviceptr.
21
![Page 22: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/22.jpg)
GPU Hackathon 2017- OpenACC
Array shaping
• Compilersometimescannotdeterminesizeofarrays• Mustspecifyexplicitlyusingdataclausesandarray“shape”• C#pragmaaccdatacopyin(a[0:size]),copyout(b[s/4:3*s/4])
• Fortran!$accdatacopyin(a(1:size)),copyout(b(s/4:3*s/4))• Note:dataclausescanbeusedondata,kernelsorparallel
22
![Page 23: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/23.jpg)
GPU Hackathon 2017- OpenACC
Jacobi Iteration: OpenACC C Code, Revisited
23
![Page 24: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/24.jpg)
GPU Hackathon 2017- OpenACC
Performance numbers
24
![Page 25: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/25.jpg)
GPU Hackathon 2017- OpenACC
New NVIDIA profiles
25
![Page 26: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/26.jpg)
GPU Hackathon 2017- OpenACC
CUDA Kernels
• Threadsaregroupedintoblocks• Blocks aregroupedintoagrid• Akernel isexecutedasagridofblocksofthreads
26
![Page 27: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/27.jpg)
GPU Hackathon 2017- OpenACC
Thread blocks• Threadblocksallowcooperation– Cooperativelyload/storeblocksofmemorythattheyalluse
– Shareresultswitheachotherorcooperatetoproduceasingleresult
– Synchronizewitheachother• Threadblocksallowscalability– Blockscanexecuteinanyorder,concurrentlyorsequentially
– Thisindependencebetweenblocksgivesscalability:• AkernelscalesacrossanynumberofSMs
27
![Page 28: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/28.jpg)
GPU Hackathon 2017- OpenACC
Mapping OpenACC to CUDA I
• TheOpenACCexecutionmodelhasthreelevels:gang,worker,andvector
• AllowsmappingtoanarchitecturethatisacollectionofProcessingElements(PEs)
• OneormorePEspernode• EachPEismulti-threaded• Eachthreadcanexecutevectorinstructions
• Tile pragmainOpenACC2.0
28
![Page 29: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/29.jpg)
GPU Hackathon 2017- OpenACC
Mapping OpenACC to CUDA II
• ForGPUs,themappingisimplementation-dependent.Somepossibilities:– gang==block,worker==warp,andvector==threadsofawarp– omit“worker”andjusthavegang==block,vector==threadsofablock
• Dependsonwhatthecompilerthinksisthebestmappingfortheproblem
• ...Butexplicitlyspecifyingthatagivenloopshouldmaptogangs,workers,and/orvectorsisoptionalanyway– Furtherspecifyingthenumberofgangs/workers/vectorsisalsooptional
– Sowhydoit?Totunethecodetofitaparticulartargetarchitectureinastraightforwardandeasilyre-tunedway.
29
![Page 30: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/30.jpg)
GPU Hackathon 2017- OpenACC
OpenACC loop directive and clauses
#pragmaacc kernelsloopfor(int i=0;i<n;++i)y[i]+=a*x[i];Useswhatevermappingtothreadsandblocksthecompilerchooses.Perhaps16blocks,256threadseach#pragma acc kernelsloopgang(100),vector(128)for(int i=0;i<n;++i)y[i]+=a*x[i];100threadblocks,eachwith128threads,eachthreadexecutesoneiterationoftheloop,usingkernels#pragma acc parallelnum_gangs(100),vector_length(128){#pragma acc loopgang,vectorfor(int i=0;i<n;++i)y[i]+=a*x[i];
}100threadblocks,eachwith128threads,eachthreadexecutesoneiterationoftheloop,usingparallel
30
![Page 31: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/31.jpg)
GPU Hackathon 2017- OpenACC
Mapping OpenACC to CUDA threads and blocks
31
• Nestedloopsgeneratemulti-dimensionalblocksandgrids:#pragmaacckernelsloopgang(100),vector(16)for(…)
#pragmaaccloopgang(200),vector(32)for(…)
16threadtallblock
100blockstall(row/Y
direction)
and32threadwide
200blockswide(column/Xdirection)
![Page 32: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/32.jpg)
GPU Hackathon 2017- OpenACC
Other clauses for loop directive
32
#pragmaaccloop[cluases]
•independent:forindependentloops•seq:forsequentialexecutionoftheloop•Reduction:forreductionoperationsuchasmin,max,etc…
![Page 33: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/33.jpg)
GPU Hackathon 2017- OpenACC
Jacobi example … again
33
WithKernelsanddatadirectives
![Page 34: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/34.jpg)
GPU Hackathon 2017- OpenACC
Jacobi example … again
34
Afteraddingloopdirectivewithgangandvectorclauses
![Page 35: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/35.jpg)
GPU Hackathon 2017- OpenACC
An opportunity for Auto-tuning
• Gangandvectorvaluescanbeauto-tunedfortheapplication,targetingtheavailableacceleratordevice
35
2.37
1.68
1.83
1.44
1.15
1.49
1.67
2.54
1.171.22
1.321.24 1.20
1.10
1.331.25 1.27 1.26
1.00
1.20
1.40
1.60
1.80
2.00
2.20
2.40
2.60
Performan
ceSpe
edup
ProblemSizes
S.Siddiqui,F.Al-Zayer,S.Feki.HistoricLearningApproachforAuto-tuningOpenACCAcceleratedScientificApplications,iWAPT2014,Eugene,Oregon,USA
![Page 36: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/36.jpg)
GPU Hackathon 2017- OpenACC
An opportunity for Auto-tuning
36
Input code annotated with OpenACC
#pragma acc kernels#pragma acc loop independentfor (x = 4 ; x < nx-4; x++) {
#pragma acc loop independentfor (y = 4; y < ny-4; y++) {
#pragma acc loop independentfor (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
Accelerator Specification
Automatic code generator
#pragma acc kernels#pragma acc loop independent gang(a),vector(b)for (x = 4 ; x < nx-4; x++) {
#pragma acc loop independent gang(c)for (y = 4; y < ny-4; y++) {
#pragma acc loop independent vector(d)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
#pragma acc kernels#pragma acc loop independent gang(a)for (x = 4 ; x < nx-4; x++) {
#pragma acc loop independent gang(b),vector(c)for (y = 4; y < ny-4; y++) {
#pragma acc loop independent vector(d)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
#pragma acc kernels#pragma acc loop independent gang(a),vector(b)for (x = 4 ; x < nx-4; x++) {
#pragma acc loop independent vector(c)for (y = 4; y < ny-4; y++) {
#pragma acc loop independent gang(d),vector(e)for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
#pragma acc kernels#pragma acc loop independent for (x = 4 ; x < nx-4; x++) { #pragma acc loop independent gang(a),vector(b)
for (y = 4; y < ny-4; y++) {#pragma acc loop independent gang(c),vector(d)
for (z = 4; k < nz-4; z++) {U[x][y][z] = c1*V[x]][y][z] + ....
}}
}
Runtime evaluation and selection
Database
![Page 37: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/37.jpg)
GPU Hackathon 2017- OpenACC
Jacobi example … again
37
• Whichotheroptimizationwecanfurtherdo?
– RestructuringthecodewillenhancebothCPUandGPUversion– Hint:reducememoryoperations
![Page 38: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/38.jpg)
GPU Hackathon 2017- OpenACC
OpenACC Runtime Library
38
• InC:#include“openacc.h”• InFortran:#include‘openacc_lib.h’ oruseopenacc• Contains:– Prototypesofallroutines– Definitionofdatatypesusedintheseroutinesincludingenumerationtypedescribingtypesofaccelerators
![Page 39: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/39.jpg)
GPU Hackathon 2017- OpenACC 39
OpenACC Runtime Library Definitions
• openacc_version withavalueyyyymm (yearandmonthoftheopenacc version)
• acc_device_t :typeofacceleratordevice– acc_device_none– acc_device_default– acc_device_host– acc_device_not_host
![Page 40: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/40.jpg)
GPU Hackathon 2017- OpenACC
• acc_get_num_devices:returnsthenumberofdevicesofthegiventypeattachedtothehost
• acc_set_device_type:tellswhichtypeofdevicetousewhenexecutinganacceleratorparallelorkernelregion.
• acc_get_device_type:tellswhichtypeofdevicetobeusedforthenextacceleratedregion
• acc_set_device_num:specifywhichdevicetouse• acc_get_device_num:returnsthedevicenumberofthe
specifieddevicetypethatwillbeusedtorunthenextacceleratorparallelorkernelsregion
40
OpenACC Runtime Library Routines I
![Page 41: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/41.jpg)
GPU Hackathon 2017- OpenACC
OpenACC Runtime Library Routines II
• acc_init:initializetheruntime,canbeusedtoisolatetheinitializationcostfromthecomputationcost
• acc_shutdown:shutdowntheconnectiontothedeviceandfreeanyallocatedresources
• acc_malloc:allocatememoryontheacceleratordevice• acc_free:freesmemoryontheacceleratordevice
41
![Page 42: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/42.jpg)
GPU Hackathon 2017- OpenACC
OpenACC Runtime Library Routines: use case
• PortinganMPIcodetomultipleGPUs.• Exampleinrunningon8nodes,with4GPUseach,i.e.32MPI
processes
• acc_init()• acc_set_device_num( rank%4)
• Eachnoderuns4MPIprocesses,eachofthemisoffloadingcomputekernelstoaseparateGPU
42
S.Feki,A.Al-Jarro,H.Bağcı.MultipleGPUsElectromagneticsSimulationsusingMPIandOpenACC,PosterinGPUTechnologyConference,SanJose,California,USA,March24-27,2014
![Page 43: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/43.jpg)
GPU Hackathon 2017- OpenACC
OpenACC and CUDA libraries
43
![Page 44: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/44.jpg)
GPU Hackathon 2017- OpenACC
GPU accelerated libraries
44
![Page 45: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/45.jpg)
GPU Hackathon 2017- OpenACC
Sharing data with libraries
• CUDAlibrariesandOpenACCbothoperateondevicearrays• OpenACCprovidesmechanismsforinteroperabilitywith
librarycalls– deviceptr dataclause– host_data construct
• Note:samemechanismsusefulforinteroperabilitywithcustomCUDAC/C++/Fortrancode
45
![Page 46: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/46.jpg)
GPU Hackathon 2017- OpenACC
deviceptr Data Clause
deviceptr(list)Declaresthatthepointersinlistrefertodevicepointersthatneednotbeallocatedormovedbetweenthehostanddeviceforthispointer.Example:• C#pragmaacc datadeviceptr(d_input)• Fortran$!acc datadeviceptr(d_input)
46
![Page 47: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/47.jpg)
GPU Hackathon 2017- OpenACC
host_data Construct
• Makestheaddressofdevicedataavailableonthehost.• deviceptr(list)Tellsthecompilertousethedeviceaddress
foranyvariableinlist.Variablesinthelistmustbepresentindevicememoryduetodataregionsthatcontainthisconstruct
• Example• C#pragmaacc host_data use_device(d_input)• Fortran$!acc host_data use_device(d_input)
47
![Page 48: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/48.jpg)
GPU Hackathon 2017- OpenACC
Summary on device pointers
• Usedeviceptr dataclausetopasspre-allocateddevicedatatoOpenACCregionsandloops
• Usehost_data togetdeviceaddressforpointersinsideaccdataregions
• ThesametechniquesshownherecanbeusedtosharedevicedatabetweenOpenACCloopsand– YourcustomCUDAC/C++/Fortran/etc.devicecode– AnyCUDALibrarythatusesCUDAdevicepointers
48GPU Hackathon 2017- OpenACC
![Page 49: Programming GPUs with OpenACC · GPU Hackathon 2017-OpenACC CPU-GPU memory model 3 PCIe Interconnect 16X -8GB/s (gen 2) and 15.75GB/s (gen 3), very thin pipe! Kepler K40 2,880 cuda](https://reader034.fdocuments.in/reader034/viewer/2022042806/5f741c05733d08644208acb7/html5/thumbnails/49.jpg)
GPU Hackathon 2017- OpenACC 49
Thanks !
GPU Hackathon 2017- OpenACC