Performance Analysis and Optimization of OpenACC...
Transcript of Performance Analysis and Optimization of OpenACC...
PERFORMANCE ANALYSIS AND OPTIMIZATION WITH OPENACC
Michael Wolfe, PGI Compilers & Tools
Performance Measurement
§ PGI_ACC_NOTIFY
§ PGI_ACC_TIME
§ pgcollect / pgprof
§ CUDA compute profiler
§ nvprof
§ Others (TAU, Vampir, ...)
§ Write Your Own OpenACC Profiler (!)
PGI_ACC_NOTIFY environment variable (Bit Mask)
1 – launch launch CUDA kernel file=smooth4.c function=smooth_acc line=17 device=0 num_gangs=98 num_workers=1 vector_length=128 grid=1x98 block=128
2 – data upload/download upload CUDA data file=smooth4.c function=smooth_acc line=12 device=0 variable=a bytes=40000 download CUDA data file=smooth4.c function=smooth_acc line=23 device=0 variable=a bytes=40000
4 – wait (explicit or implicit) for device Implicit wait file=smooth4.c function=smooth_acc line=17 device=0 Implicit wait file=smooth4.c function=smooth_acc line=23 device=0
8 – data/compute region enter/leave Enter data region file=smooth4.c function=smooth_acc line=12 device=0 Enter compute region file=smooth4.c function=smooth_acc line=14 device=0 Leave compute region file=smooth4.c function=smooth_acc line=17 device=0
16 – data create/allocate/delete/free create CUDA data bytes=40000 file=smooth4.c function=smooth_acc line=12 device=0 alloc CUDA data bytes=40000 file=smooth4.c function=smooth_acc line=12 device=0 delete CUDA data bytes=40448 file=smooth4.c function=smooth_acc line=23 device=0
PGI_ACC_TIME environment variable
Accelerator Kernel Timing data /proj/scratch/mwolfe/test/openacc/src/smooth4.c smooth_acc NVIDIA devicenum=0 time(us): 317 12: data region reached 5 times 12: data copyin reached 10 times device time(us): total=121 max=19 min=11 avg=12 23: data copyout reached 5 times device time(us): total=63 max=14 min=12 avg=12 14: compute region reached 5 times 17: kernel launched 5 times grid: [1x98] block: [128] device time(us): total=133 max=90 min=9 avg=26 elapsed time(us): total=176 max=99 min=17 avg=35
PGI_ACC_TIME environment variable
§ Data collected per host thread and summed across threads
§ Not valid with async; set PGI_ACC_SYNCHRONOUS=1 Accelerator Kernel Timing data Timing may be affected by asynchronous behavior set PGI_ACC_SYNCHRONOUS to 1 to disable async() clauses /proj/scratch/mwolfe/test/openacc/src/async1.f90 testasync NVIDIA devicenum=0 time(us): 304 19: compute region reached 1 time 21: kernel launched 1 time grid: [977] block: [256] device time(us): total=84 max=84 min=84 avg=84 elapsed time(us): total=94 max=94 min=94 avg=94
pgcollect / pgprof
§ pgcollect [-cuda] a.out — without –cuda, uses PGI_ACC_TIME data collection
— with –cuda, uses compute profile data collection
§ pgprof [-exe a.out] [pgprof.out]
Compute Profiling in CUDA Driver
§ COMPUTE_PROFILE=1
§ COMPUTE_PROFILE_LOG=outputfile
§ COMPUTE_PROFILE_CSV=1
§ COMPUTE_PROFILE_CONFIG=configfile
§ nvprof –query-events
COMPUTE_PROFILE=1
method,gputime,cputime,occupancy method=[ memcpyHtoDasync ] gputime=[ 98.304 ] cputime=[ 13.284 ] method=[ testasync_21_gpu ] gputime=[ 44.192 ] cputime=[ 71.610 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.816 ] cputime=[ 4.076 ] method=[ memcpyHtoDasync ] gputime=[ 98.304 ] cputime=[ 12.817 ] method=[ testasync_31_gpu ] gputime=[ 59.712 ] cputime=[ 11.876 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.624 ] cputime=[ 3.873 ] method=[ memcpyHtoDasync ] gputime=[ 98.400 ] cputime=[ 12.821 ] method=[ testasync_41_gpu ] gputime=[ 60.352 ] cputime=[ 12.124 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.688 ] cputime=[ 3.894 ] method=[ memcpyHtoDasync ] gputime=[ 98.432 ] cputime=[ 13.090 ] method=[ testasync_51_gpu ] gputime=[ 61.376 ] cputime=[ 11.492 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 97.760 ] cputime=[ 3.660 ]
COMPUTE_PROFILE_CSV=1 # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla K40c # CUDA_CONTEXT 1 # CUDA_PROFILE_CSV 1 # TIMESTAMPFACTOR 1351e3d72b398b2b method gpuOme cpuOme occupancy memcpyHtoDasync 98.336 13.642 testasync_21_gpu 44.48 71.91 1 memcpyDtoHasync 98.848 4.097 memcpyHtoDasync 98.304 13.922 testasync_31_gpu 61.28 11.867 1 memcpyDtoHasync 98.72 3.792 memcpyHtoDasync 98.304 13.615 testasync_41_gpu 60.288 11.505 1 memcpyDtoHasync 98.688 3.928 memcpyHtoDasync 98.432 13.916 testasync_51_gpu 60.736 11.287 1 memcpyDtoHasync 97.632 3.898
CUDA_PROFILE_CONFIG=configfile # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla K40c # CUDA_CONTEXT 1 # CUDA_PROFILE_CSV 1 # TIMESTAMPFACTOR 1351e3d6c025a4be method gpuOme cpuOme occupancy warps_launched inst_issued1 inst_issued2 inst_executed acOve_cycles
memcpyHtoDasync 98.336 13.455
testasync_21_gpu 44.16 125.335 1 520 52331 21002 86978 31120
memcpyDtoHasync 99.136 4.197
memcpyHtoDasync 98.272 13.321
testasync_31_gpu 61.184 83.689 1 528 73033 28720 119856 43605
memcpyDtoHasync 97.504 4.282
memcpyHtoDasync 98.56 12.866
testasync_41_gpu 59.776 83.276 1 528 72388 28733 119856 42936
memcpyDtoHasync 97.6 4.145
memcpyHtoDasync 98.72 13.22
testasync_51_gpu 61.344 84.597 1 528 73930 28718 119856 43931
memcpyDtoHasync 97.344 4.117
nvprof a.out
==30736== NVPROF is profiling process 30736, command: a.out ==30736== Profiling application: a.out ==30736== Profiling result: Time(%) Time Calls Avg Min Max Name 38.89% 393.41us 4 98.352us 98.176us 98.496us [CUDA memcpy HtoD] 38.88% 393.25us 4 98.312us 97.760us 98.688us [CUDA memcpy DtoH] 6.05% 61.184us 1 61.184us 61.184us 61.184us testasync_31_gpu 5.92% 59.872us 1 59.872us 59.872us 59.872us testasync_41_gpu 5.88% 59.520us 1 59.520us 59.520us 59.520us testasync_51_gpu 4.38% 44.320us 1 44.320us 44.320us 44.320us testasync_21_gpu
nvprof –print-gpu-trace –csv a.out
Start DuraOon Grid X Grid Y Grid Z Block X Block Y Block Z
Registers Per
Thread StaOc SMem
Dynamic SMem Size Throughput Device Context Stream Name
ms us B B MB GB/s
249.4308 98.304 1 10.17253 Tesla K40c 1 8
[CUDA memcpy HtoD]
250.1725 44.064 977 1 1 256 1 1 19 0 0 Tesla K40c 1 8
testasync_21_gpu [22]
250.22 98.272 1 10.17584 Tesla K40c 1 8
[CUDA memcpy DtoH]
269.0212 98.528 1 10.1494 Tesla K40c 1 9
[CUDA memcpy HtoD]
269.6442 60.032 977 1 1 256 1 1 19 0 0 Tesla K40c 1 9
testasync_31_gpu [36]
269.7065 98.272 1 10.17584 Tesla K40c 1 9
[CUDA memcpy DtoH]
288.3552 98.304 1 10.17253 Tesla K40c 1 10
[CUDA memcpy HtoD]
Write Your Own OpenACC Profiler
§ Defined set of ~30 OpenACC events — enqueue_launch_[start|end]
— enqueue_upload_[start|end]
— construct_[start|end]
§ Defined callback routine interface — register callback routines for events of interest
— collect data during runtime
§ Save / display information when done
§ Static link or dynamic shared object loading
Performance Optimization
§ Profile the CPU application — ensure you are accelerating the right part of your code
§ Look at data movement — look for hidden data transfers
§ Look at expensive kernels
Analyzing Kernels
§ Launch configuration — block size > 1
— grid size > 1
§ Occupancy — registers per thread
§ Memory operations — too many memory loads/stores (no caching)
— stride-1 in vector index
Analysis and Optimization of OpenACC
§ Performance Measurement — PGI_ACC_TIME
— pgcollect / pgprof
— cuda driver profiler
— nvprof
— other profilers
§ Tuning — data movement
— launch configuration
— kernel code performance