CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...
Transcript of CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...
CUDA OPTIMIZATION WITH
NVIDIA NSIGHT™ VISUAL STUDIO EDITION
Julien Demouth, NVIDIA
WHAT WILL YOU LEARN? An iterative method to optimize your GPU code
A way to conduct that method with NVIDIA Nsight VSE
https://github.com/jdemouth/nsight-gtc2014
Blur
A WORD ABOUT THE APPLICATION
Grayscale
Edges
A WORD ABOUT THE APPLICATION Grayscale Conversion
// r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;
A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter
foreach pixel p: p = weighted sum of p and its 48 neighbors
16 12 8 4
9 6 3
6 4 2
3 2 1
6 3
4 2
9
6
3 2 1
4 8 12
3 6 9
2 4 6
1 2 3
3 6 9
2 4 6
1 2 3
12
8
4
4
8
12
Image from Wikipedia
A WORD ABOUT THE APPLICATION Edges: 3x3 Sobel Filters
foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy)
-1 0 1
-2 0 2
-1 0 1
Weights for Gx:
1 2 1
0 0 0
-1 -2 -1
Weights for Gy:
OPTIMIZATION METHOD Trace the Application
Identify the Hot Spot and Profile it
Identify the Performance Limiter
— Memory Bandwidth
— Instruction Throughput
— Latency
Optimize the Code
Iterate
We focus on the Assess and Optimize steps of the APOD method. We do not talk about the Parallelize and Deploy steps http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy
ENVIRONMENT NVIDIA Tesla K20c (GK110, SM3.5) without ECC
Microsoft Windows 7 x64
Microsoft Visual Studio 2012
NVIDIA CUDA 6.0
NVIDIA Nsight Visual Studio Edition 4.0
BEFORE WE START Some slides are for background
Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013
http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization-Guidelines-GPU-Architecture-Details.mp4
http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf
CUDA Best Practices Guide
http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/
Chameleon from http://www.vectorportal.com, Creative Commons
BEFORE WE START Instructions are executed by warps of threads
— It is a hardware concept
— There are 32 threads per warp
ITERATION 1
TRACE THE APPLICATION
Select
Trace
Application
Activate
CUDA
Launch
Verify
Parameters
TIMELINE
CUDA LAUNCH SUMMARY
The Hotspot is gaussian_filter_7x7_v0
Kernel Time Speedup
Original version 6.265ms
Hotspot
PROFILE THE HOTSPOT
Select
Profile CUDA
Application
Select the Kernel
Launch
Select the
Experiments (All)
IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
MEMORY BANDWIDTH
SMEM/L1$
Registers
SM
SMEM/L1$
Registers
SM
Global Memory (Framebuffer)
L2$
208GB/s (K20)
MEMORY BANDWIDTH Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 2%
Not limited by memory bandwidth
INSTRUCTION THROUGHPUT
Each SM has 4 schedulers (Kepler)
Schedulers issue instructions to pipes
A scheduler issues up to 2 instructions/cycle — Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8)
A scheduler issues inst. from a single warp
Cannot issue to a pipe if its issue slot is full
SMEM/L1$
Registers
SM
Pipes Pipes Pipes Pipes
Sched Sched Sched Sched
INSTRUCTION THROUGHPUT
Sched Sched Sched Sched
Schedulers saturated
Utilization: 90%
Load
Store Texture
Control
Flow ALU
11% 8%
65%
6%
Sched Sched Sched Sched
Schedulers and pipe
saturated
4%
27%
Utilization: 92%
Load
Store Texture
Control
Flow ALU
90%
Sched Sched Sched Sched
Pipe saturated
78%
Utilization: 64%
Load
Store Texture
Control
Flow ALU
24%
4%
WARP ISSUE EFFICIENCY
Percentage of issue slots used (blue)
Aggregated over all the schedulers
PIPE UTILIZATION
Percentages of issue slots used per pipe
Accounts for pipe throughputs
Four groups of pipes:
— Load/Store
— Texture
— Control Flow
— Arithmetic (ALU)
INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated
Not limited by the instruction throughput
LATENCY GPUs cover latencies by having a lot of work in flight
warp 0
warp 1
warp 2
warp 3
warp 4
warp 5
warp 6
warp 7
warp 8
warp 9
The warp issues
The warp waits (latency)
Fully covered latency Exposed latency
No warp issues
LATENCY: LACK OF OCCUPANCY Not enough active warps
The schedulers cannot find eligible warps at every cycle
warp 0
warp 1
warp 2
warp 3
No warp issues
LATENCY
50% of theoretical occupancy
31.2 active warps per cycle
3.57 warps eligible per cycle
Hard to tell. Let’s start with occupancy
OCCUPANCY Each SM has limited resources
64K Registers (32 bit) shared by threads
Up to 48KB of Shared memory
16 slots to execute Blocks
Full occupancy: 2048 threads per SM (64 warps)
Values for SM30/SM35. They vary with Compute Capability
OCCUPANCY (BLOCK DIMENSION) Limited by the number of blocks
Blocks are too small (64 threads/block)
OCCUPANCY (BLOCK DIMENSION)
Increase Block Size
for (slightly) better
occupancy
OCCUPANCY (BLOCK DIMENSION) Increase the block size to 128 threads (8x16)
It runs slightly faster: 6.074ms
Kernel Time Speedup
Original version 6.263ms
Larger blocks 6.074ms 1.03x
ITERATION 2
TRACE THE APPLICATION
The hotspot is still gaussian_filter_7x7_v0
Hotspot
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?
Is it limited by the instruction throughput ?
Is it limited by latency ?
MEMORY BANDWIDTH Utilization of L2$ BW limited and DRAM BW < 2%
Not limited by memory bandwidth
INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated
Not limited by the instruction throughput
LATENCY (OCCUPANCY) Limited occupancy (56%) but 4.62 eligible warps/cycle (> 4)
There is probably something else limiting our performance
MEMORY TRANSACTIONS Warp of threads (32 threads)
L1 transaction: 128B – Alignment: 128B (0, 128, 256, …)
L2 transaction: 32B – Alignment: 32B (0, 32, 64, 96, …)
MEMORY TRANSACTIONS A warp issues 32x4B aligned and consecutive loads/stores
Threads read different elements of the same 128B segment
1x L1 transaction: 128B needed / 128B transferred
4x L2 transactions: 128B needed / 128B transferred
1x 128B L1 transaction per warp
4x 32B L2 transactions per warp
MEMORY TRANSACTIONS Threads in a warp read/write 4B words, 128B between words
Each thread reads the first 4B of a 128B segment
32x L1 transactions: 128B needed / 32x 128B transferred
32x L2 transactions: 128B needed / 32x 32B transferred
1x 128B L1 transaction per thread
1x 32B L2 transaction per thread
32x
Threads 24-31 Threads 0-7
TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B
1 instr. executed and 2 replays = 1 request and 3 transactions
Threads 8-15
Threads 16-23
Time
Instruction issued Instruction re-issued
1st replay
Threads
0-7/24-31
Threads
8-15
Instruction re-issued
2nd replay
Threads
16-23
TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources
— More instructions issued
— More memory traffic
— Increased execution time
Inst. 0
Issued
Inst. 1
Issued
Inst. 2
Issued
Execution time
Threads
0-7/24-31
Threads
8-15
Threads
16-23
Inst. 0
Completed
Inst. 1
Completed
Inst. 2
Completed
Threads
0-7/24-31
Threads
8-15
Threads
16-23
Transfer data for inst. 0
Transfer data for inst. 1
Transfer data for inst. 2
Extra latency Extra work (SM)
Extra memory traffic
TRANSACTIONS PER REQUEST Transactions per Request: 4.20 (Load) / 4.00 (Store)
Too many memory transactions (too much pressure on LSU)
TRANSACTIONS PER REQUEST Our blocks are 8x16
We should use blocks of size 32x4 (or 32x8)
Warp 0
Warp 1
Warp 2
Warp 3
27 28 29 30
36 37 38
44 45 46
52 53 54
21 22
13 14
20
12
4 5 6
24 25 26
32 33 34
40 41 42
48 49 50
16 17 18
8 9 10
0 1 2
19
11
3
51
43
35
31
39
47
55
23
15
7
60 61 62 56 57 58 59 63
4 5 6 0 1 2 3 7 13 14 12 8 9 10 11 15 21 22 20 16 17 18 19 23 27 28 29 30 24 25 26 31
36 37 38 32 33 34 35 39 44 45 46 40 41 42 43 47 52 53 54 48 49 50 51 55 60 61 62 56 57 58 59 63
threadIdx.x
… 64 65 66
IMPROVED MEMORY ACCESSES Improved memory accesses: Blocks of size 32x4
It runs faster: 3.605ms
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
ITERATION 3
TRACE THE APPLICATION
The hotspot is still gaussian_filter_7x7_v0
Hotspot
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
We use the same block size for the Sobel filter kernel. That’s the reason why it also improves (2nd row of the Nsight table).
MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%
Not limited by memory bandwidth
INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated
Not limited by instruction throughput
LATENCY (OCCUPANCY) Limited occupancy and not enough eligible warps/cycle (2.0)
Need more active warps (i.e. occupancy)
LATENCY (OCCUPANCY) Limited by register usage
REDUCE REGISTER USAGE Use the __launch_bounds__ attribute
10 gives the best results (48 registers): 1.949ms
__global__ __launch_bounds__(128, 10) void gaussian_filter_7x7_v1(int w, int h, const uchar *src, uchar *dst)
Number of threads per block Minimum number of blocks
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
ITERATION 4
TRACE THE APPLICATION
The hotspot is gaussian_filter_7x7_v1
Hotspot
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%
Not limited by memory bandwidth
INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated
Not limited by instruction throughput
LATENCY (OCCUPANCY) Enough active and eligible warps per cycle
Not limited by a lack of occupancy
BRANCH DIVERGENCE Threads of a warp take different branches of a conditional
if( threadIdx.x < 12 ) {}
else {}
Time
Threads execute the “if” branch Threads execute the “else” branch
Execution time = “if” branch + “else” branch
BRANCH DIVERGENCE But no divergence in our code
OTHER IDEAS Shared memory bank conflicts
— Conflicts: Two threads read different addresses from a same bank
Excessive usage of synchronizations (__syncthreads)
But those symptoms do not affect our case
MEMORY TRANSFERS Our image size: 2560x1600 = 4MB
We read 385MB from L2$: Too much traffic!
MEMORY TRANSFERS We do not saturate the issue slot of the Load-Store unit
But we saturate inside the Load-Store unit
Unfortunately, we cannot detect that with Nsight (yet)
Adjacent pixels access have neighbors in common
We should use shared memory to store those common pixels
SHARED MEMORY
__shared__ unsigned char smem_pixels[10][64];
USE SHARED MEMORY Use shared memory to keep data on the SM: 1.211ms
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
Shared memory 1.211ms 5.17x
ITERATION 5
TRACE THE APPLICATION
The hotspot is gaussian_filter_7x7_v2
Hotspot
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
Shared memory 1.211ms 5.17x
MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 4%
Not limited by memory bandwidth
INSTRUCTION THROUGHPUT Not saturating the schedulers
But we use 73% of the Load-Store issue slot
LOAD-STORE INSTRUCTIONS LSU executes global and shared memory instructions
Change global loads to use the Read-Only path
READ-ONLY CACHE (TEXTURE UNITS)
SMEM/L1$
Registers
SM
SMEM/L1$
Registers
SM
Global Memory (Framebuffer)
L2$
Texture Units Texture Units Skip LSU
Cache loads
READ-ONLY PATH Annotate our pointer with const __restrict
The compiler generates LDG instructions: 1.019ms
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
Shared memory 1.211ms 5.17x
Read-Only path 1.019ms 6.15x
__global__ void gaussian_filter_7x7_v3(int w, int h, const uchar *__restrict src, uchar *dst)
INSTRUCTION THROUGHPUT We are doing much better
Things to investigate next:
— Improve memory efficiency
— Reduce computational intensity (separable filter)
MORE IN OUR COMPANION CODE
Kernel Time Speedup
Original version 6.265ms
Larger blocks 6.075ms 1.03x
Better memory accesses 3.605ms 1.74x
Fewer registers 1.949ms 3.21x
Shared memory 1.211ms 5.17x
Read-Only path 1.019ms 6.15x
Separable filter 0.656ms 9.55x
Process two pixels per thread (improve memory efficiency + add ILP) 0.511ms 12.26x
Use 64-bit shared memory (remove bank conflicts) 0.499ms 12.56x
Use float instead of int (increase instruction throughput) 0.434ms 14.44x
Your next idea!!!
https://github.com/jdemouth/nsight-gtc2014
CONCLUSION
OPTIMIZATION METHOD Trace the Application
Identify the Hot Spot and Profile it
Identify the Performance Limiter
— Memory Bandwidth
— Instruction Throughput
— Latency
Optimize the Code
Iterate
https://github.com/jdemouth/nsight-gtc2014