CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...

74
CUDA OPTIMIZATION WITH NVIDIA NSIGHT™ VISUAL STUDIO EDITION Julien Demouth, NVIDIA

Transcript of CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012...

Page 1: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

CUDA OPTIMIZATION WITH

NVIDIA NSIGHT™ VISUAL STUDIO EDITION

Julien Demouth, NVIDIA

Page 2: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

WHAT WILL YOU LEARN? An iterative method to optimize your GPU code

A way to conduct that method with NVIDIA Nsight VSE

https://github.com/jdemouth/nsight-gtc2014

Page 3: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

Blur

A WORD ABOUT THE APPLICATION

Grayscale

Edges

Page 4: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

A WORD ABOUT THE APPLICATION Grayscale Conversion

// r, g, b: Red, green, blue components of the pixel p foreach pixel p: p = 0.298839f*r + 0.586811f*g + 0.114350f*b;

Page 5: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

A WORD ABOUT THE APPLICATION Blur: 7x7 Gaussian Filter

foreach pixel p: p = weighted sum of p and its 48 neighbors

16 12 8 4

9 6 3

6 4 2

3 2 1

6 3

4 2

9

6

3 2 1

4 8 12

3 6 9

2 4 6

1 2 3

3 6 9

2 4 6

1 2 3

12

8

4

4

8

12

Image from Wikipedia

Page 6: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

A WORD ABOUT THE APPLICATION Edges: 3x3 Sobel Filters

foreach pixel p: Gx = weighted sum of p and its 8 neighbors Gy = weighted sum of p and its 8 neighbors p = sqrt(Gx + Gy)

-1 0 1

-2 0 2

-1 0 1

Weights for Gx:

1 2 1

0 0 0

-1 -2 -1

Weights for Gy:

Page 7: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

OPTIMIZATION METHOD Trace the Application

Identify the Hot Spot and Profile it

Identify the Performance Limiter

— Memory Bandwidth

— Instruction Throughput

— Latency

Optimize the Code

Iterate

We focus on the Assess and Optimize steps of the APOD method. We do not talk about the Parallelize and Deploy steps http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/#assess-parallelize-optimize-deploy

Page 8: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

ENVIRONMENT NVIDIA Tesla K20c (GK110, SM3.5) without ECC

Microsoft Windows 7 x64

Microsoft Visual Studio 2012

NVIDIA CUDA 6.0

NVIDIA Nsight Visual Studio Edition 4.0

Page 9: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

BEFORE WE START Some slides are for background

Performance Optimization: Programming Guidelines and GPU Architecture Details Behind Them, GTC 2013

http://on-demand.gputechconf.com/gtc/2013/video/S3466-Performance-Optimization-Guidelines-GPU-Architecture-Details.mp4

http://on-demand.gputechconf.com/gtc/2013/presentations/S3466-Programming-Guidelines-GPU-Architecture.pdf

CUDA Best Practices Guide

http://docs.nvidia.com/cuda/cuda-c-best-practices-guide/

Chameleon from http://www.vectorportal.com, Creative Commons

Page 10: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

BEFORE WE START Instructions are executed by warps of threads

— It is a hardware concept

— There are 32 threads per warp

Page 11: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

ITERATION 1

Page 12: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TRACE THE APPLICATION

Select

Trace

Application

Activate

CUDA

Launch

Verify

Parameters

Page 13: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TIMELINE

Page 14: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

CUDA LAUNCH SUMMARY

The Hotspot is gaussian_filter_7x7_v0

Kernel Time Speedup

Original version 6.265ms

Hotspot

Page 15: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

PROFILE THE HOTSPOT

Select

Profile CUDA

Application

Select the Kernel

Launch

Select the

Experiments (All)

Page 16: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

Page 17: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY BANDWIDTH

SMEM/L1$

Registers

SM

SMEM/L1$

Registers

SM

Global Memory (Framebuffer)

L2$

208GB/s (K20)

Page 18: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY BANDWIDTH Utilization of L2$ Bandwidth (BW) limited and DRAM BW < 2%

Not limited by memory bandwidth

Page 19: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

INSTRUCTION THROUGHPUT

Each SM has 4 schedulers (Kepler)

Schedulers issue instructions to pipes

A scheduler issues up to 2 instructions/cycle — Sustainable peak is 7 instructions/cycle per SM (not 4x2 = 8)

A scheduler issues inst. from a single warp

Cannot issue to a pipe if its issue slot is full

SMEM/L1$

Registers

SM

Pipes Pipes Pipes Pipes

Sched Sched Sched Sched

Page 20: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

INSTRUCTION THROUGHPUT

Sched Sched Sched Sched

Schedulers saturated

Utilization: 90%

Load

Store Texture

Control

Flow ALU

11% 8%

65%

6%

Sched Sched Sched Sched

Schedulers and pipe

saturated

4%

27%

Utilization: 92%

Load

Store Texture

Control

Flow ALU

90%

Sched Sched Sched Sched

Pipe saturated

78%

Utilization: 64%

Load

Store Texture

Control

Flow ALU

24%

4%

Page 21: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

WARP ISSUE EFFICIENCY

Percentage of issue slots used (blue)

Aggregated over all the schedulers

Page 22: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

PIPE UTILIZATION

Percentages of issue slots used per pipe

Accounts for pipe throughputs

Four groups of pipes:

— Load/Store

— Texture

— Control Flow

— Arithmetic (ALU)

Page 23: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by the instruction throughput

Page 24: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

LATENCY GPUs cover latencies by having a lot of work in flight

warp 0

warp 1

warp 2

warp 3

warp 4

warp 5

warp 6

warp 7

warp 8

warp 9

The warp issues

The warp waits (latency)

Fully covered latency Exposed latency

No warp issues

Page 25: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

LATENCY: LACK OF OCCUPANCY Not enough active warps

The schedulers cannot find eligible warps at every cycle

warp 0

warp 1

warp 2

warp 3

No warp issues

Page 26: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

LATENCY

50% of theoretical occupancy

31.2 active warps per cycle

3.57 warps eligible per cycle

Hard to tell. Let’s start with occupancy

Page 27: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

OCCUPANCY Each SM has limited resources

64K Registers (32 bit) shared by threads

Up to 48KB of Shared memory

16 slots to execute Blocks

Full occupancy: 2048 threads per SM (64 warps)

Values for SM30/SM35. They vary with Compute Capability

Page 28: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

OCCUPANCY (BLOCK DIMENSION) Limited by the number of blocks

Blocks are too small (64 threads/block)

Page 29: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

OCCUPANCY (BLOCK DIMENSION)

Increase Block Size

for (slightly) better

occupancy

Page 30: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

OCCUPANCY (BLOCK DIMENSION) Increase the block size to 128 threads (8x16)

It runs slightly faster: 6.074ms

Kernel Time Speedup

Original version 6.263ms

Larger blocks 6.074ms 1.03x

Page 31: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

ITERATION 2

Page 32: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TRACE THE APPLICATION

The hotspot is still gaussian_filter_7x7_v0

Hotspot

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Page 33: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

IDENTIFY THE MAIN LIMITER Is it limited by the memory bandwidth ?

Is it limited by the instruction throughput ?

Is it limited by latency ?

Page 34: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY BANDWIDTH Utilization of L2$ BW limited and DRAM BW < 2%

Not limited by memory bandwidth

Page 35: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by the instruction throughput

Page 36: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

LATENCY (OCCUPANCY) Limited occupancy (56%) but 4.62 eligible warps/cycle (> 4)

There is probably something else limiting our performance

Page 37: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY TRANSACTIONS Warp of threads (32 threads)

L1 transaction: 128B – Alignment: 128B (0, 128, 256, …)

L2 transaction: 32B – Alignment: 32B (0, 32, 64, 96, …)

Page 38: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY TRANSACTIONS A warp issues 32x4B aligned and consecutive loads/stores

Threads read different elements of the same 128B segment

1x L1 transaction: 128B needed / 128B transferred

4x L2 transactions: 128B needed / 128B transferred

1x 128B L1 transaction per warp

4x 32B L2 transactions per warp

Page 39: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY TRANSACTIONS Threads in a warp read/write 4B words, 128B between words

Each thread reads the first 4B of a 128B segment

32x L1 transactions: 128B needed / 32x 128B transferred

32x L2 transactions: 128B needed / 32x 32B transferred

1x 128B L1 transaction per thread

1x 32B L2 transaction per thread

32x

Page 40: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

Threads 24-31 Threads 0-7

TRANSACTIONS AND REPLAYS A warp reads from addresses spanning 3 lines of 128B

1 instr. executed and 2 replays = 1 request and 3 transactions

Threads 8-15

Threads 16-23

Time

Instruction issued Instruction re-issued

1st replay

Threads

0-7/24-31

Threads

8-15

Instruction re-issued

2nd replay

Threads

16-23

Page 41: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TRANSACTIONS AND REPLAYS With replays, requests take more time and use more resources

— More instructions issued

— More memory traffic

— Increased execution time

Inst. 0

Issued

Inst. 1

Issued

Inst. 2

Issued

Execution time

Threads

0-7/24-31

Threads

8-15

Threads

16-23

Inst. 0

Completed

Inst. 1

Completed

Inst. 2

Completed

Threads

0-7/24-31

Threads

8-15

Threads

16-23

Transfer data for inst. 0

Transfer data for inst. 1

Transfer data for inst. 2

Extra latency Extra work (SM)

Extra memory traffic

Page 42: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TRANSACTIONS PER REQUEST Transactions per Request: 4.20 (Load) / 4.00 (Store)

Too many memory transactions (too much pressure on LSU)

Page 43: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TRANSACTIONS PER REQUEST Our blocks are 8x16

We should use blocks of size 32x4 (or 32x8)

Warp 0

Warp 1

Warp 2

Warp 3

27 28 29 30

36 37 38

44 45 46

52 53 54

21 22

13 14

20

12

4 5 6

24 25 26

32 33 34

40 41 42

48 49 50

16 17 18

8 9 10

0 1 2

19

11

3

51

43

35

31

39

47

55

23

15

7

60 61 62 56 57 58 59 63

4 5 6 0 1 2 3 7 13 14 12 8 9 10 11 15 21 22 20 16 17 18 19 23 27 28 29 30 24 25 26 31

36 37 38 32 33 34 35 39 44 45 46 40 41 42 43 47 52 53 54 48 49 50 51 55 60 61 62 56 57 58 59 63

threadIdx.x

… 64 65 66

Page 44: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

IMPROVED MEMORY ACCESSES Improved memory accesses: Blocks of size 32x4

It runs faster: 3.605ms

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Page 45: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

ITERATION 3

Page 46: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TRACE THE APPLICATION

The hotspot is still gaussian_filter_7x7_v0

Hotspot

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

We use the same block size for the Sobel filter kernel. That’s the reason why it also improves (2nd row of the Nsight table).

Page 47: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%

Not limited by memory bandwidth

Page 48: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by instruction throughput

Page 49: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

LATENCY (OCCUPANCY) Limited occupancy and not enough eligible warps/cycle (2.0)

Need more active warps (i.e. occupancy)

Page 50: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

LATENCY (OCCUPANCY) Limited by register usage

Page 51: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

REDUCE REGISTER USAGE Use the __launch_bounds__ attribute

10 gives the best results (48 registers): 1.949ms

__global__ __launch_bounds__(128, 10) void gaussian_filter_7x7_v1(int w, int h, const uchar *src, uchar *dst)

Number of threads per block Minimum number of blocks

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Page 52: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

ITERATION 4

Page 53: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TRACE THE APPLICATION

The hotspot is gaussian_filter_7x7_v1

Hotspot

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Page 54: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 2%

Not limited by memory bandwidth

Page 55: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

INSTRUCTION THROUGHPUT Neither schedulers nor pipes are saturated

Not limited by instruction throughput

Page 56: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

LATENCY (OCCUPANCY) Enough active and eligible warps per cycle

Not limited by a lack of occupancy

Page 57: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

BRANCH DIVERGENCE Threads of a warp take different branches of a conditional

if( threadIdx.x < 12 ) {}

else {}

Time

Threads execute the “if” branch Threads execute the “else” branch

Execution time = “if” branch + “else” branch

Page 58: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

BRANCH DIVERGENCE But no divergence in our code

Page 59: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

OTHER IDEAS Shared memory bank conflicts

— Conflicts: Two threads read different addresses from a same bank

Excessive usage of synchronizations (__syncthreads)

But those symptoms do not affect our case

Page 60: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY TRANSFERS Our image size: 2560x1600 = 4MB

We read 385MB from L2$: Too much traffic!

Page 61: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY TRANSFERS We do not saturate the issue slot of the Load-Store unit

But we saturate inside the Load-Store unit

Unfortunately, we cannot detect that with Nsight (yet)

Page 62: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

Adjacent pixels access have neighbors in common

We should use shared memory to store those common pixels

SHARED MEMORY

__shared__ unsigned char smem_pixels[10][64];

Page 63: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

USE SHARED MEMORY Use shared memory to keep data on the SM: 1.211ms

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Shared memory 1.211ms 5.17x

Page 64: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

ITERATION 5

Page 65: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

TRACE THE APPLICATION

The hotspot is gaussian_filter_7x7_v2

Hotspot

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Shared memory 1.211ms 5.17x

Page 66: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MEMORY BANDWIDTH Utilization of L2$ BW low and DRAM BW < 4%

Not limited by memory bandwidth

Page 67: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

INSTRUCTION THROUGHPUT Not saturating the schedulers

But we use 73% of the Load-Store issue slot

Page 68: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

LOAD-STORE INSTRUCTIONS LSU executes global and shared memory instructions

Change global loads to use the Read-Only path

Page 69: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

READ-ONLY CACHE (TEXTURE UNITS)

SMEM/L1$

Registers

SM

SMEM/L1$

Registers

SM

Global Memory (Framebuffer)

L2$

Texture Units Texture Units Skip LSU

Cache loads

Page 70: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

READ-ONLY PATH Annotate our pointer with const __restrict

The compiler generates LDG instructions: 1.019ms

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Shared memory 1.211ms 5.17x

Read-Only path 1.019ms 6.15x

__global__ void gaussian_filter_7x7_v3(int w, int h, const uchar *__restrict src, uchar *dst)

Page 71: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

INSTRUCTION THROUGHPUT We are doing much better

Things to investigate next:

— Improve memory efficiency

— Reduce computational intensity (separable filter)

Page 72: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

MORE IN OUR COMPANION CODE

Kernel Time Speedup

Original version 6.265ms

Larger blocks 6.075ms 1.03x

Better memory accesses 3.605ms 1.74x

Fewer registers 1.949ms 3.21x

Shared memory 1.211ms 5.17x

Read-Only path 1.019ms 6.15x

Separable filter 0.656ms 9.55x

Process two pixels per thread (improve memory efficiency + add ILP) 0.511ms 12.26x

Use 64-bit shared memory (remove bank conflicts) 0.499ms 12.56x

Use float instead of int (increase instruction throughput) 0.434ms 14.44x

Your next idea!!!

https://github.com/jdemouth/nsight-gtc2014

Page 73: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

CONCLUSION

Page 74: CUDA Optimization with NVIDIA Nsight™ Visual Studio ...€¦ · Microsoft Visual Studio 2012 NVIDIA CUDA 6.0 NVIDIA Nsight Visual Studio Edition 4.0 . BEFORE WE START Some slides

OPTIMIZATION METHOD Trace the Application

Identify the Hot Spot and Profile it

Identify the Performance Limiter

— Memory Bandwidth

— Instruction Throughput

— Latency

Optimize the Code

Iterate

https://github.com/jdemouth/nsight-gtc2014