Performance Analysis and Optimization of OpenACC...

20
PERFORMANCE ANALYSIS AND OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers & Tools

Transcript of Performance Analysis and Optimization of OpenACC...

Page 1: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

PERFORMANCE ANALYSIS AND OPTIMIZATION WITH OPENACC

Michael Wolfe, PGI Compilers & Tools

Page 2: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

Performance Measurement

§ PGI_ACC_NOTIFY

§ PGI_ACC_TIME

§ pgcollect / pgprof

§ CUDA compute profiler

§ nvprof

§ Others (TAU, Vampir, ...)

§ Write Your Own OpenACC Profiler (!)

Page 3: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

PGI_ACC_NOTIFY environment variable (Bit Mask)

1 – launch launch CUDA kernel file=smooth4.c function=smooth_acc line=17 device=0 num_gangs=98 num_workers=1 vector_length=128 grid=1x98 block=128

2 – data upload/download upload CUDA data file=smooth4.c function=smooth_acc line=12 device=0 variable=a bytes=40000 download CUDA data file=smooth4.c function=smooth_acc line=23 device=0 variable=a bytes=40000

4 – wait (explicit or implicit) for device Implicit wait file=smooth4.c function=smooth_acc line=17 device=0 Implicit wait file=smooth4.c function=smooth_acc line=23 device=0

8 – data/compute region enter/leave Enter data region file=smooth4.c function=smooth_acc line=12 device=0 Enter compute region file=smooth4.c function=smooth_acc line=14 device=0 Leave compute region file=smooth4.c function=smooth_acc line=17 device=0

16 – data create/allocate/delete/free create CUDA data bytes=40000 file=smooth4.c function=smooth_acc line=12 device=0 alloc CUDA data bytes=40000 file=smooth4.c function=smooth_acc line=12 device=0 delete CUDA data bytes=40448 file=smooth4.c function=smooth_acc line=23 device=0

Page 4: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

PGI_ACC_TIME environment variable

Accelerator Kernel Timing data /proj/scratch/mwolfe/test/openacc/src/smooth4.c smooth_acc NVIDIA devicenum=0 time(us): 317 12: data region reached 5 times 12: data copyin reached 10 times device time(us): total=121 max=19 min=11 avg=12 23: data copyout reached 5 times device time(us): total=63 max=14 min=12 avg=12 14: compute region reached 5 times 17: kernel launched 5 times grid: [1x98] block: [128] device time(us): total=133 max=90 min=9 avg=26 elapsed time(us): total=176 max=99 min=17 avg=35

Page 5: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

PGI_ACC_TIME environment variable

§ Data collected per host thread and summed across threads

§ Not valid with async; set PGI_ACC_SYNCHRONOUS=1 Accelerator Kernel Timing data Timing may be affected by asynchronous behavior set PGI_ACC_SYNCHRONOUS to 1 to disable async() clauses /proj/scratch/mwolfe/test/openacc/src/async1.f90 testasync NVIDIA devicenum=0 time(us): 304 19: compute region reached 1 time 21: kernel launched 1 time grid: [977] block: [256] device time(us): total=84 max=84 min=84 avg=84 elapsed time(us): total=94 max=94 min=94 avg=94

Page 6: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

pgcollect / pgprof

§ pgcollect [-cuda] a.out — without –cuda, uses PGI_ACC_TIME data collection

— with –cuda, uses compute profile data collection

§ pgprof [-exe a.out] [pgprof.out]

Page 7: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers
Page 8: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers
Page 9: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers
Page 10: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers
Page 11: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

Compute Profiling in CUDA Driver

§ COMPUTE_PROFILE=1

§ COMPUTE_PROFILE_LOG=outputfile

§ COMPUTE_PROFILE_CSV=1

§ COMPUTE_PROFILE_CONFIG=configfile

§ nvprof –query-events

Page 12: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

COMPUTE_PROFILE=1

method,gputime,cputime,occupancy method=[ memcpyHtoDasync ] gputime=[ 98.304 ] cputime=[ 13.284 ] method=[ testasync_21_gpu ] gputime=[ 44.192 ] cputime=[ 71.610 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.816 ] cputime=[ 4.076 ] method=[ memcpyHtoDasync ] gputime=[ 98.304 ] cputime=[ 12.817 ] method=[ testasync_31_gpu ] gputime=[ 59.712 ] cputime=[ 11.876 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.624 ] cputime=[ 3.873 ] method=[ memcpyHtoDasync ] gputime=[ 98.400 ] cputime=[ 12.821 ] method=[ testasync_41_gpu ] gputime=[ 60.352 ] cputime=[ 12.124 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.688 ] cputime=[ 3.894 ] method=[ memcpyHtoDasync ] gputime=[ 98.432 ] cputime=[ 13.090 ] method=[ testasync_51_gpu ] gputime=[ 61.376 ] cputime=[ 11.492 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 97.760 ] cputime=[ 3.660 ]

Page 13: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

COMPUTE_PROFILE_CSV=1 #  CUDA_PROFILE_LOG_VERSION  2.0  #  CUDA_DEVICE  0  Tesla  K40c  #  CUDA_CONTEXT  1  #  CUDA_PROFILE_CSV  1  #  TIMESTAMPFACTOR  1351e3d72b398b2b  method   gpuOme   cpuOme   occupancy  memcpyHtoDasync   98.336   13.642  testasync_21_gpu   44.48   71.91   1  memcpyDtoHasync   98.848   4.097  memcpyHtoDasync   98.304   13.922  testasync_31_gpu   61.28   11.867   1  memcpyDtoHasync   98.72   3.792  memcpyHtoDasync   98.304   13.615  testasync_41_gpu   60.288   11.505   1  memcpyDtoHasync   98.688   3.928  memcpyHtoDasync   98.432   13.916  testasync_51_gpu   60.736   11.287   1  memcpyDtoHasync   97.632   3.898  

Page 14: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

CUDA_PROFILE_CONFIG=configfile #  CUDA_PROFILE_LOG_VERSION  2.0  #  CUDA_DEVICE  0  Tesla  K40c  #  CUDA_CONTEXT  1  #  CUDA_PROFILE_CSV  1  #  TIMESTAMPFACTOR  1351e3d6c025a4be  method   gpuOme  cpuOme   occupancy  warps_launched   inst_issued1   inst_issued2   inst_executed  acOve_cycles  

memcpyHtoDasync   98.336   13.455  

testasync_21_gpu   44.16  125.335   1   520   52331   21002   86978   31120  

memcpyDtoHasync   99.136   4.197  

memcpyHtoDasync   98.272   13.321  

testasync_31_gpu   61.184   83.689   1   528   73033   28720   119856   43605  

memcpyDtoHasync   97.504   4.282  

memcpyHtoDasync   98.56   12.866  

testasync_41_gpu   59.776   83.276   1   528   72388   28733   119856   42936  

memcpyDtoHasync   97.6   4.145  

memcpyHtoDasync   98.72   13.22  

testasync_51_gpu   61.344   84.597   1   528   73930   28718   119856   43931  

memcpyDtoHasync   97.344   4.117  

Page 15: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

nvprof a.out

==30736== NVPROF is profiling process 30736, command: a.out ==30736== Profiling application: a.out ==30736== Profiling result: Time(%) Time Calls Avg Min Max Name 38.89% 393.41us 4 98.352us 98.176us 98.496us [CUDA memcpy HtoD] 38.88% 393.25us 4 98.312us 97.760us 98.688us [CUDA memcpy DtoH] 6.05% 61.184us 1 61.184us 61.184us 61.184us testasync_31_gpu 5.92% 59.872us 1 59.872us 59.872us 59.872us testasync_41_gpu 5.88% 59.520us 1 59.520us 59.520us 59.520us testasync_51_gpu 4.38% 44.320us 1 44.320us 44.320us 44.320us testasync_21_gpu

Page 16: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

nvprof –print-gpu-trace –csv a.out

Start   DuraOon   Grid  X   Grid  Y   Grid  Z   Block  X   Block  Y   Block  Z  

Registers  Per  

Thread  StaOc  SMem  

Dynamic  SMem   Size   Throughput   Device  Context   Stream   Name  

ms   us   B   B   MB   GB/s  

249.4308  98.304   1  10.17253  Tesla  K40c   1   8  

[CUDA  memcpy  HtoD]  

250.1725  44.064   977   1   1   256   1   1   19   0   0  Tesla  K40c   1   8  

testasync_21_gpu  [22]  

250.22  98.272   1  10.17584  Tesla  K40c   1   8  

[CUDA  memcpy  DtoH]  

269.0212  98.528   1   10.1494  Tesla  K40c   1   9  

[CUDA  memcpy  HtoD]  

269.6442  60.032   977   1   1   256   1   1   19   0   0  Tesla  K40c   1   9  

testasync_31_gpu  [36]  

269.7065  98.272   1  10.17584  Tesla  K40c   1   9  

[CUDA  memcpy  DtoH]  

288.3552  98.304   1  10.17253  Tesla  K40c   1   10  

[CUDA  memcpy  HtoD]  

Page 17: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

Write Your Own OpenACC Profiler

§ Defined set of ~30 OpenACC events — enqueue_launch_[start|end]

— enqueue_upload_[start|end]

—  construct_[start|end]

§ Defined callback routine interface —  register callback routines for events of interest

—  collect data during runtime

§ Save / display information when done

§ Static link or dynamic shared object loading

Page 18: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

Performance Optimization

§ Profile the CPU application — ensure you are accelerating the right part of your code

§ Look at data movement —  look for hidden data transfers

§ Look at expensive kernels

Page 19: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

Analyzing Kernels

§ Launch configuration — block size > 1

— grid size > 1

§ Occupancy —  registers per thread

§ Memory operations —  too many memory loads/stores (no caching)

—  stride-1 in vector index

Page 20: Performance Analysis and Optimization of OpenACC …on-demand.gputechconf.com/gtc/2014/presentations/...openacc-apps.pdf · OPTIMIZATION WITH OPENACC Michael Wolfe, PGI Compilers

Analysis and Optimization of OpenACC

§ Performance Measurement — PGI_ACC_TIME

— pgcollect / pgprof

—  cuda driver profiler

— nvprof

— other profilers

§ Tuning — data movement

—  launch configuration

— kernel code performance