Performance Analysis and Optimization of OpenACC...

PERFORMANCE ANALYSIS AND OPTIMIZATION WITH OPENACC

Michael Wolfe, PGI Compilers & Tools

Performance Measurement

§ PGI_ACC_NOTIFY

§ PGI_ACC_TIME

§ pgcollect / pgprof

§ CUDA compute profiler

§ nvprof

§ Others (TAU, Vampir, ...)

§ Write Your Own OpenACC Profiler (!)

PGI_ACC_NOTIFY environment variable (Bit Mask)

1 – launch launch CUDA kernel file=smooth4.c function=smooth_acc line=17 device=0 num_gangs=98 num_workers=1 vector_length=128 grid=1x98 block=128

2 – data upload/download upload CUDA data file=smooth4.c function=smooth_acc line=12 device=0 variable=a bytes=40000 download CUDA data file=smooth4.c function=smooth_acc line=23 device=0 variable=a bytes=40000

4 – wait (explicit or implicit) for device Implicit wait file=smooth4.c function=smooth_acc line=17 device=0 Implicit wait file=smooth4.c function=smooth_acc line=23 device=0

8 – data/compute region enter/leave Enter data region file=smooth4.c function=smooth_acc line=12 device=0 Enter compute region file=smooth4.c function=smooth_acc line=14 device=0 Leave compute region file=smooth4.c function=smooth_acc line=17 device=0

16 – data create/allocate/delete/free create CUDA data bytes=40000 file=smooth4.c function=smooth_acc line=12 device=0 alloc CUDA data bytes=40000 file=smooth4.c function=smooth_acc line=12 device=0 delete CUDA data bytes=40448 file=smooth4.c function=smooth_acc line=23 device=0

PGI_ACC_TIME environment variable

Accelerator Kernel Timing data /proj/scratch/mwolfe/test/openacc/src/smooth4.c smooth_acc NVIDIA devicenum=0 time(us): 317 12: data region reached 5 times 12: data copyin reached 10 times device time(us): total=121 max=19 min=11 avg=12 23: data copyout reached 5 times device time(us): total=63 max=14 min=12 avg=12 14: compute region reached 5 times 17: kernel launched 5 times grid: [1x98] block: [128] device time(us): total=133 max=90 min=9 avg=26 elapsed time(us): total=176 max=99 min=17 avg=35

PGI_ACC_TIME environment variable

§ Data collected per host thread and summed across threads

§ Not valid with async; set PGI_ACC_SYNCHRONOUS=1 Accelerator Kernel Timing data Timing may be affected by asynchronous behavior set PGI_ACC_SYNCHRONOUS to 1 to disable async() clauses /proj/scratch/mwolfe/test/openacc/src/async1.f90 testasync NVIDIA devicenum=0 time(us): 304 19: compute region reached 1 time 21: kernel launched 1 time grid: [977] block: [256] device time(us): total=84 max=84 min=84 avg=84 elapsed time(us): total=94 max=94 min=94 avg=94

pgcollect / pgprof

§ pgcollect [-cuda] a.out — without –cuda, uses PGI_ACC_TIME data collection

— with –cuda, uses compute profile data collection

§ pgprof [-exe a.out] [pgprof.out]

Compute Profiling in CUDA Driver

§ COMPUTE_PROFILE=1

§ COMPUTE_PROFILE_LOG=outputfile

§ COMPUTE_PROFILE_CSV=1

§ COMPUTE_PROFILE_CONFIG=configfile

§ nvprof –query-events

COMPUTE_PROFILE=1

method,gputime,cputime,occupancy method=[ memcpyHtoDasync ] gputime=[ 98.304 ] cputime=[ 13.284 ] method=[ testasync_21_gpu ] gputime=[ 44.192 ] cputime=[ 71.610 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.816 ] cputime=[ 4.076 ] method=[ memcpyHtoDasync ] gputime=[ 98.304 ] cputime=[ 12.817 ] method=[ testasync_31_gpu ] gputime=[ 59.712 ] cputime=[ 11.876 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.624 ] cputime=[ 3.873 ] method=[ memcpyHtoDasync ] gputime=[ 98.400 ] cputime=[ 12.821 ] method=[ testasync_41_gpu ] gputime=[ 60.352 ] cputime=[ 12.124 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 98.688 ] cputime=[ 3.894 ] method=[ memcpyHtoDasync ] gputime=[ 98.432 ] cputime=[ 13.090 ] method=[ testasync_51_gpu ] gputime=[ 61.376 ] cputime=[ 11.492 ] occupancy=[ 1.000 ] method=[ memcpyDtoHasync ] gputime=[ 97.760 ] cputime=[ 3.660 ]

COMPUTE_PROFILE_CSV=1 # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla K40c # CUDA_CONTEXT 1 # CUDA_PROFILE_CSV 1 # TIMESTAMPFACTOR 1351e3d72b398b2b method gpuOme cpuOme occupancy memcpyHtoDasync 98.336 13.642 testasync_21_gpu 44.48 71.91 1 memcpyDtoHasync 98.848 4.097 memcpyHtoDasync 98.304 13.922 testasync_31_gpu 61.28 11.867 1 memcpyDtoHasync 98.72 3.792 memcpyHtoDasync 98.304 13.615 testasync_41_gpu 60.288 11.505 1 memcpyDtoHasync 98.688 3.928 memcpyHtoDasync 98.432 13.916 testasync_51_gpu 60.736 11.287 1 memcpyDtoHasync 97.632 3.898

CUDA_PROFILE_CONFIG=configfile # CUDA_PROFILE_LOG_VERSION 2.0 # CUDA_DEVICE 0 Tesla K40c # CUDA_CONTEXT 1 # CUDA_PROFILE_CSV 1 # TIMESTAMPFACTOR 1351e3d6c025a4be method gpuOme cpuOme occupancy warps_launched inst_issued1 inst_issued2 inst_executed acOve_cycles

memcpyHtoDasync 98.336 13.455

testasync_21_gpu 44.16 125.335 1 520 52331 21002 86978 31120

memcpyDtoHasync 99.136 4.197


testasync_31_gpu 61.184 83.689 1 528 73033 28720 119856 43605



testasync_41_gpu 59.776 83.276 1 528 72388 28733 119856 42936



testasync_51_gpu 61.344 84.597 1 528 73930 28718 119856 43931


nvprof a.out

==30736== NVPROF is profiling process 30736, command: a.out ==30736== Profiling application: a.out ==30736== Profiling result: Time(%) Time Calls Avg Min Max Name 38.89% 393.41us 4 98.352us 98.176us 98.496us [CUDA memcpy HtoD] 38.88% 393.25us 4 98.312us 97.760us 98.688us [CUDA memcpy DtoH] 6.05% 61.184us 1 61.184us 61.184us 61.184us testasync_31_gpu 5.92% 59.872us 1 59.872us 59.872us 59.872us testasync_41_gpu 5.88% 59.520us 1 59.520us 59.520us 59.520us testasync_51_gpu 4.38% 44.320us 1 44.320us 44.320us 44.320us testasync_21_gpu

nvprof –print-gpu-trace –csv a.out

Start DuraOon Grid X Grid Y Grid Z Block X Block Y Block Z

Registers Per

Thread StaOc SMem

Dynamic SMem Size Throughput Device Context Stream Name

ms us B B MB GB/s

249.4308 98.304 1 10.17253 Tesla K40c 1 8

[CUDA memcpy HtoD]

250.1725 44.064 977 1 1 256 1 1 19 0 0 Tesla K40c 1 8

testasync_21_gpu [22]

250.22 98.272 1 10.17584 Tesla K40c 1 8

[CUDA memcpy DtoH]

269.0212 98.528 1 10.1494 Tesla K40c 1 9

[CUDA memcpy HtoD]

269.6442 60.032 977 1 1 256 1 1 19 0 0 Tesla K40c 1 9

testasync_31_gpu [36]

269.7065 98.272 1 10.17584 Tesla K40c 1 9

[CUDA memcpy DtoH]

288.3552 98.304 1 10.17253 Tesla K40c 1 10

[CUDA memcpy HtoD]

Write Your Own OpenACC Profiler

§ Defined set of ~30 OpenACC events — enqueue_launch_[start|end]

— enqueue_upload_[start|end]

—  construct_[start|end]

§ Defined callback routine interface —  register callback routines for events of interest

—  collect data during runtime

§ Save / display information when done

§ Static link or dynamic shared object loading

Performance Optimization

§ Profile the CPU application — ensure you are accelerating the right part of your code

§ Look at data movement —  look for hidden data transfers

§ Look at expensive kernels

Analyzing Kernels

§ Launch configuration — block size > 1

— grid size > 1

§ Occupancy —  registers per thread

§ Memory operations —  too many memory loads/stores (no caching)

—  stride-1 in vector index

Analysis and Optimization of OpenACC

§ Performance Measurement — PGI_ACC_TIME

— pgcollect / pgprof

—  cuda driver profiler

— nvprof

— other profilers

§ Tuning — data movement

—  launch configuration

— kernel code performance

Performance Analysis and Optimization of OpenACC...

Documents

Transcript of Performance Analysis and Optimization of OpenACC...