IBM XL Compiler: OpenMPoffloading support for GPU - · PDF file · 2016-11-30IBM XL...

IBMXLCompiler:OpenMP offloadingsupportforGPU

EttoreTiotto,KelvinLiIBMCanadaLaboratory

SomeslidescourtesyofAlexEichenberg (IBM)

§ Modernhigh-performancescientificapplicationsmustexploitheterogeneousresourcesinaperformanceportablemanner

§ WhatProgrammingModelsshouldweuse?What’stherightlevelofabstraction?

CPU

GPU

11/30/16 2

IBMPOWER8processors

NVIDIATeslaP100GPU

ProgrammingHeterogeneousSystems

§ Extractingmaximumperformance:

• to program a GPU: you have to use CUDA, OpenCL, OpenGL, DirectX, Intrinsics, C++AMP, OpenACC

• to program a host SIMD unit: you have to use Intrinsics, OpenCL, or auto-vectorization (possibly aided by compiler hints)

• to program the CPU threads, you might use C11, C++11, OpenMP, TBB, Cilk, MS Async/then continuation, Apple GCD, Google executors, …

§ WithOpenMP 4.0/4.5:• youcanusethesamestandardtoprogramtheGPU,theSIMDunits,andtheCPU

threads• Betteryet:youcandosoinaportableway

ProgrammingHeterogeneousSystems

11/30/16 3

OpenMP 4.5

§ OpenMP isanindustrystandardfordirectivebasedparallelprogramming• OpenMP hasbeen(andis)widelyusedtoprogramCPUs• InOpenMP 4.0/4.5,newfeatureshavebeenaddedtoprovidesupport

foroffloadingcomputationtoaccelerators• Industry-wideacceptance:IBM,Intel,PathScale,Cray,PGI,Oracle,MS

è applicationportability

11/30/16 4

§ HowdoweexploitanacceleratorinOpenMP?§ Simplyaddatarget constructaroundthecomputationtobeoffloadedto

theaccelerator§ map clausesareusedtocopydata

#pragma omp target map(to: A, B) map(from: C)#pragma omp parallel forfor (i=0; i<N; i++) {for (j=0; j<N; j++)for (k=0; k<N; k++)C[i][j] = A[i][k] * B[k][j];

}

AQuickIntroductiontoOpenMP 4.5

11/30/16 5

GPUde

viceexecutio

nCPU

map()

target

inactivehostth

read

#pragmaomp targetmap(to:A,B)map(from:C){…}

• target transfer control of execution to a SINGLE device thread• the compiler packages the target region into a function• the OpenMP runtime transfer execution of the function to the device

devicemasterthreadhostthread

copy*A,B

copy*C

deviceparallelregion

devicesyncbarrier

deviceworkerthreads

*copiesoptionalwithunifiedmemory

AQuickIntroductiontoOpenMP 4.5

11/30/16 6

§ The“distribute”directivecanbeusedtoassignloopiterationstoteams

• thetargetregionisexecutedbyseveralteams,eachteamgetsasubsetofiterationspaceforthei-loop

• thej-loopiterationsaredistributedamongstthethreadsinateam• distributeschedulecontrolssizeofiterationsperteam,thereisnosynchronization

betweenteams

#pragma omp target teamsmap(to: a, b) map(from: c)

{#pragma omp distributefor (int i=0; i<n; i++) {#pragma omp parallel forfor (int j=0; j<n; j++)for (int k=0; k<n; k++)c[i][j] = a[i][k] * b[k][j];

}}

deviceinitialthreads

oneteam

OpenMP 4.5:DeviceExecution

11/30/16 7

Optimization:omp distributeparallelfor§ Programmingmodel:OpenMP vs CUDA

• OpenMP usesafork-joinabstraction• teamregionsstartwithonethread,andparallelthreadsare

createdasneededwhenaparallelregionisfound• CUDAkernelsarelaunchedusingagridofblocks/threads

(SPMDmodel)

#pragma omp target map(from: z) map(to:x,y)#pragma omp teams#pragma omp distribute parallel for

for (i=0; i<N; i++)z[i] = a*x[i] + y[i];

§ OrchestratingCUDAthreadstofittheOpenMP programmingmodelcanhavesignificantoverhead(runtimemanagesstatetransitions)

§ HoweverOpenMP provides“SPMD-like”directives

• distributeparallelfordirectivecanbeusedtodistributeloopiterationsamongstteamsandthenexecutethoseiterationinparallelusingthethreadsineachteam

• CompilercangenerateefficientGPUcodeforthisconstruct(statetransitionsnotrequiredè bypassOpenMP runtimesystem)

• Defaultschedulerecommendedtomaximizeperformanceportability§ HWcoalescingonGPU,goodcachelocalityonCPU

11/30/16 8

XLC/C++andXLFortranCompilers

§ XLC/C++andXLFortranarefull-featuredcompilersthathasbeentargetingthePOWERplatformsince1990• AggressivelytunedtotakemaximumadvantageofIBMprocessor

technologyasitbecomesavailable• Industryleadingcustomersupport&service

§ TheXLcompilerproductsusecommonoptimizerandbackendtechnology• LeveragematurecompileroptimizationinfrastructureforbothCPU

andGPUexploitation,acrosssourcelanguages

11/30/16 9

CPU/GPUW-CodePartitioner

OpenMP 4.5supportinXLC/C++andXLFortran

11/30/16 10

CPUW-Code

Executable for POWER/GPU system

Data flow, loop, other optimizations

High-LevelOptimizer

GPUW-Code

libNVVMW-CodetoLLVMIRtranslator

nvlink

PTXCodeGen

LLVMOptimizer

PTXAssembler

NVVMIR

PTX

LibrariesCUDA RuntimeCUDA DriverSystemLinker

NVVM=LLVMwithNVIDIAenhancements

XLC/C++Frontend

C/C++ source

W-Code(XLIR)

XL DeviceLibraries

CUDA DeviceLibraries

POWERLow-levelOptimizerLow-level Optimizations

POWER Code Generation

Register Allocation + Scheduling

CPUcodeisaggressivelyoptimizedforPOWER

XL’soptimizerseesbothhostanddevicecode

CUDAToolkitoptimizesdevicecode

W-Code

XLFortranFrontend

Fortran source

OpenMP 4.0&4.5offloadingfeaturesFeatures OpenMP 3.1

(intargetregion)OpenMP 4.0 OpenMP 4.5

OpenMPDirective

ParallelConstruct• omp parallel• omp sections• parallelworkshare

Worksharing• paralleldo/for• omp ordered• omp single

Synchronization• omp master• omp critical• omp barrier• omp atomic• omp flush

DeviceConstructs• omp targetdata• omp target• omp targetupdate• omp declaretarget• omp teams• omp distribute• omp distribute parallelfor• omp declaretarget• combinedconstructsSIMDConstructs• omp loopsimd• Omp distributeparalleldo

simd• omp simd• omp declaresimd• omp distributesimd

OffloadingEnhancements• firstprivate, private,defaultmap

• mapchanges(4.5semantics)• ifclauseforcombineddirectives

• implicitfirstprivate (4.5)• omp targetenterdata• omp targetexitdata• omp targetparallel• targetnowait &depend• omp targetsimd

[italic meansinprogress]

11/30/16 11

§ InitialsupportforOpenMP V4.5featuresforGPUoffloading§ SupportS822LCsystems(POWER8+P100viaNVLink)§ SupportforNVIDIAK40,K80,andP100GPUs§ SupportforCUDAToolkit8.0§ SupportedOperatingSystems:Ubuntu16.04,RHEL7.3...

XLC/C++forLinuxV13.1.5XLFortranforLinuxV15.1.5

GA:

compilingtheOpenMP programs

§ -qsmp=omp option:enablestheOpenMP compileinthecompiler

§ -qoffload option:enablesthetargetconstructsbeingoffloadedtoGPU(ifavailable)• withoutthe–qoffload option,thetargetregionsareexecutedonthe

CPUhost

§ forexample$ xlf90 –qsmp=omp –qoffload test1.f

$ xlc –qsmp=omp –qoffload –qhot –O3 test2.c

11/30/16 13

useprofilingtool

§ nvprof providesinformationaboutexecution§ output:

11/30/16 14

Questions?

11/30/16 15

IBM XL Compiler: OpenMPoffloading support for GPU - · PDF file · 2016-11-30IBM XL...

Documents

Transcript of IBM XL Compiler: OpenMPoffloading support for GPU - · PDF file · 2016-11-30IBM XL...