IBM XL Compiler: OpenMPoffloading support for GPU - · PDF file · 2016-11-30IBM XL...
Transcript of IBM XL Compiler: OpenMPoffloading support for GPU - · PDF file · 2016-11-30IBM XL...
IBMXLCompiler:OpenMP offloadingsupportforGPU
EttoreTiotto,KelvinLiIBMCanadaLaboratory
SomeslidescourtesyofAlexEichenberg (IBM)
§ Modernhigh-performancescientificapplicationsmustexploitheterogeneousresourcesinaperformanceportablemanner
§ WhatProgrammingModelsshouldweuse?What’stherightlevelofabstraction?
CPU
GPU
11/30/16 2
IBMPOWER8processors
NVIDIATeslaP100GPU
ProgrammingHeterogeneousSystems
§ Extractingmaximumperformance:
• to program a GPU: you have to use CUDA, OpenCL, OpenGL, DirectX, Intrinsics, C++AMP, OpenACC
• to program a host SIMD unit: you have to use Intrinsics, OpenCL, or auto-vectorization (possibly aided by compiler hints)
• to program the CPU threads, you might use C11, C++11, OpenMP, TBB, Cilk, MS Async/then continuation, Apple GCD, Google executors, …
§ WithOpenMP 4.0/4.5:• youcanusethesamestandardtoprogramtheGPU,theSIMDunits,andtheCPU
threads• Betteryet:youcandosoinaportableway
ProgrammingHeterogeneousSystems
11/30/16 3
OpenMP 4.5
§ OpenMP isanindustrystandardfordirectivebasedparallelprogramming• OpenMP hasbeen(andis)widelyusedtoprogramCPUs• InOpenMP 4.0/4.5,newfeatureshavebeenaddedtoprovidesupport
foroffloadingcomputationtoaccelerators• Industry-wideacceptance:IBM,Intel,PathScale,Cray,PGI,Oracle,MS
è applicationportability
11/30/16 4
§ HowdoweexploitanacceleratorinOpenMP?§ Simplyaddatarget constructaroundthecomputationtobeoffloadedto
theaccelerator§ map clausesareusedtocopydata
#pragma omp target map(to: A, B) map(from: C)#pragma omp parallel forfor (i=0; i<N; i++) {for (j=0; j<N; j++)for (k=0; k<N; k++)C[i][j] = A[i][k] * B[k][j];
}
AQuickIntroductiontoOpenMP 4.5
11/30/16 5
GPUde
viceexecutio
nCPU
map()
target
inactivehostth
read
#pragmaomp targetmap(to:A,B)map(from:C){…}
• target transfer control of execution to a SINGLE device thread• the compiler packages the target region into a function• the OpenMP runtime transfer execution of the function to the device
devicemasterthreadhostthread
copy*A,B
copy*C
deviceparallelregion
devicesyncbarrier
deviceworkerthreads
*copiesoptionalwithunifiedmemory
AQuickIntroductiontoOpenMP 4.5
11/30/16 6
§ The“distribute”directivecanbeusedtoassignloopiterationstoteams
• thetargetregionisexecutedbyseveralteams,eachteamgetsasubsetofiterationspaceforthei-loop
• thej-loopiterationsaredistributedamongstthethreadsinateam• distributeschedulecontrolssizeofiterationsperteam,thereisnosynchronization
betweenteams
#pragma omp target teamsmap(to: a, b) map(from: c)
{#pragma omp distributefor (int i=0; i<n; i++) {#pragma omp parallel forfor (int j=0; j<n; j++)for (int k=0; k<n; k++)c[i][j] = a[i][k] * b[k][j];
}}
deviceinitialthreads
oneteam
OpenMP 4.5:DeviceExecution
11/30/16 7
Optimization:omp distributeparallelfor§ Programmingmodel:OpenMP vs CUDA
• OpenMP usesafork-joinabstraction• teamregionsstartwithonethread,andparallelthreadsare
createdasneededwhenaparallelregionisfound• CUDAkernelsarelaunchedusingagridofblocks/threads
(SPMDmodel)
#pragma omp target map(from: z) map(to:x,y)#pragma omp teams#pragma omp distribute parallel for
for (i=0; i<N; i++)z[i] = a*x[i] + y[i];
§ OrchestratingCUDAthreadstofittheOpenMP programmingmodelcanhavesignificantoverhead(runtimemanagesstatetransitions)
§ HoweverOpenMP provides“SPMD-like”directives
• distributeparallelfordirectivecanbeusedtodistributeloopiterationsamongstteamsandthenexecutethoseiterationinparallelusingthethreadsineachteam
• CompilercangenerateefficientGPUcodeforthisconstruct(statetransitionsnotrequiredè bypassOpenMP runtimesystem)
• Defaultschedulerecommendedtomaximizeperformanceportability§ HWcoalescingonGPU,goodcachelocalityonCPU
11/30/16 8
XLC/C++andXLFortranCompilers
§ XLC/C++andXLFortranarefull-featuredcompilersthathasbeentargetingthePOWERplatformsince1990• AggressivelytunedtotakemaximumadvantageofIBMprocessor
technologyasitbecomesavailable• Industryleadingcustomersupport&service
§ TheXLcompilerproductsusecommonoptimizerandbackendtechnology• LeveragematurecompileroptimizationinfrastructureforbothCPU
andGPUexploitation,acrosssourcelanguages
11/30/16 9
CPU/GPUW-CodePartitioner
OpenMP 4.5supportinXLC/C++andXLFortran
11/30/16 10
CPUW-Code
Executable for POWER/GPU system
Data flow, loop, other optimizations
High-LevelOptimizer
GPUW-Code
libNVVMW-CodetoLLVMIRtranslator
nvlink
PTXCodeGen
LLVMOptimizer
PTXAssembler
NVVMIR
PTX
LibrariesCUDA RuntimeCUDA DriverSystemLinker
NVVM=LLVMwithNVIDIAenhancements
XLC/C++Frontend
C/C++ source
W-Code(XLIR)
XL DeviceLibraries
CUDA DeviceLibraries
POWERLow-levelOptimizerLow-level Optimizations
POWER Code Generation
Register Allocation + Scheduling
CPUcodeisaggressivelyoptimizedforPOWER
XL’soptimizerseesbothhostanddevicecode
CUDAToolkitoptimizesdevicecode
W-Code
XLFortranFrontend
Fortran source
OpenMP 4.0&4.5offloadingfeaturesFeatures OpenMP 3.1
(intargetregion)OpenMP 4.0 OpenMP 4.5
OpenMPDirective
ParallelConstruct• omp parallel• omp sections• parallelworkshare
Worksharing• paralleldo/for• omp ordered• omp single
Synchronization• omp master• omp critical• omp barrier• omp atomic• omp flush
DeviceConstructs• omp targetdata• omp target• omp targetupdate• omp declaretarget• omp teams• omp distribute• omp distribute parallelfor• omp declaretarget• combinedconstructsSIMDConstructs• omp loopsimd• Omp distributeparalleldo
simd• omp simd• omp declaresimd• omp distributesimd
OffloadingEnhancements• firstprivate, private,defaultmap
• mapchanges(4.5semantics)• ifclauseforcombineddirectives
• implicitfirstprivate (4.5)• omp targetenterdata• omp targetexitdata• omp targetparallel• targetnowait &depend• omp targetsimd
[italic meansinprogress]
11/30/16 11
§ InitialsupportforOpenMP V4.5featuresforGPUoffloading§ SupportS822LCsystems(POWER8+P100viaNVLink)§ SupportforNVIDIAK40,K80,andP100GPUs§ SupportforCUDAToolkit8.0§ SupportedOperatingSystems:Ubuntu16.04,RHEL7.3...
XLC/C++forLinuxV13.1.5XLFortranforLinuxV15.1.5
GA:
compilingtheOpenMP programs
§ -qsmp=omp option:enablestheOpenMP compileinthecompiler
§ -qoffload option:enablesthetargetconstructsbeingoffloadedtoGPU(ifavailable)• withoutthe–qoffload option,thetargetregionsareexecutedonthe
CPUhost
§ forexample$ xlf90 –qsmp=omp –qoffload test1.f
$ xlc –qsmp=omp –qoffload –qhot –O3 test2.c
11/30/16 13