Silicon Valley DEEP DIVE INTO DYNAMIC...

57
1 April 4-7, 2016 | Silicon Valley SHANKARA RAO THEJASWI NANDITALE, NVIDIA CHRISTOPH ANGERER, NVIDIA DEEP DIVE INTO DYNAMIC PARALLELISM

Transcript of Silicon Valley DEEP DIVE INTO DYNAMIC...

Page 1: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

1

April 4-7, 2016 | Silicon Valley

SHANKARA RAO THEJASWI NANDITALE, NVIDIA

CHRISTOPH ANGERER, NVIDIA

DEEP DIVE INTODYNAMIC PARALLELISM

Page 2: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

2

OVERVIEW AND INTRODUCTION

Page 3: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

3

WHAT IS DYNAMIC PARALLELISM?

The ability to launch new kernels from the GPU

Dynamically - based on run-time data

Simultaneously - from multiple threads at once

Independently - each thread can launch a different grid

Introduced with CUDA 5.0 and compute capability 3.5 and up

CPU GPU CPU GPU

Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself

Page 4: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

44

CPU GPU CPU GPU

DYNAMIC PARALLELISM

Page 5: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

55

AN EASY TO PARALLELIZE PROGRAM

for i = 1 to Nfor j = 1 to M

convolution(i, j)next j

next i

M

N

Page 6: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

66

for i = 1 to Nfor j = 1 to x[i]

convolution(i, j)next j

next i

A DIFFICULT TO PARALLELIZE PROGRAM

Page 7: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

77

A DIFFICULT TO PARALLELIZE PROGRAM

for i = 1 to Nfor j = 1 to x[i]

convolution(i, j)next j

next i

N

max(x[i])

N

Bad alternative #2: Tail Effect

Bad alternative #1: Idle Threads

Page 8: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

88

DYNAMIC PARALLELISM

for i = 1 to Nfor j = 1 to x[i]

convolution(i, j)next j

next i

Serial Program

__global__ void convolution(int x[]){

for j = 1 to x[blockIdx]kernel<<< ... >>>(blockIdx, j)

}

CUDA Program

With Dynamic Parallelism

N

void main(){

setup(x);convolution<<< N, 1 >>>(x);

}

Page 9: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

9

EXPERIMENT

* Device/SDK = K40m/v7.5

* K40m-CPU = E5-2690

Tim

e (

ms)

low

er

is b

ett

er

Matrix Size

0

50

100

150

200

250

300

512 1024 2048 4096 8192 16384

dynpar idleThreads tailEffect

Page 10: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

10

LAUNCH EXAMPLE

B<<<1,1>>>()

SM SM SM

Grid Scheduler

SM

Grid A

A0

Task Tracking Structures

A0 Tracking Structure

cudaLaunchDevice( B, 1, 1 );

Page 11: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

11

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A

A0

Task Tracking Structures

A0 Tracking Structure

Allocate Task data structure

Page 12: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

12

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A

A0

Task Tracking Structures

B

A0 Tracking Structure

Fill out Task data structure

Page 13: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

13

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A

A0

A0 Tracking Structure

Task Tracking Structures

B

Track Task B in Block A0

Page 14: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

14

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1

>>>()

Grid A

A0

Task Tracking Structures

A0 Tracking Structure

B

Launch Task B to GPU

Page 15: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

15

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

C<<<1,1>>>()

Grid A, Grid B

A0 B0

Task Tracking Structures

A0 Tracking Structure

B

cudaLaunchDevice( C, 1, 1 );

Page 16: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

16

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

C<<<1,1>>>()

Grid A, Grid B

A0 B0

Task Tracking Structures

A0 Tracking Structure

B

Allocate, fill out, and track

Task C in block A0

C

Page 17: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

17

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Grid B

A0 B0

Task Tracking Structures

A0 Tracking Structure

B C

Task C is not yet runnable.

Track C to run after B.

Page 18: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

18

LAUNCH EXAMPLETask Tracking Structures

A0 Tracking Structure

B C

Task B completes.

SKED runs Scheduler.

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler

A0

Task B completes.

Scheduler kernel runs.

Page 19: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

19

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler

A0 Sched

Task Tracking Structures

A0 Tracking Structure

B C

Scheduler searches for work.

Page 20: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

20

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler

A0 Sched

A0 Tracking Structure

Task Tracking Structures

B C

Scheduler completes B, and

Identifies C as ready-to-run.

Page 21: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

21

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler

A0 Sched

C<<<1,1

>>>()

Task Tracking Structures

A0 Tracking Structure

C

Scheduler frees B for re-use, and

launches C to the Grid Scheduler.

Page 22: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

22

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Grid C

A0 C0

Task Tracking Structures

A0 Tracking Structure

C

Task C now executes.

Page 23: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

2323

BASIC RULES

Programming Model

Essentially the same as CUDA

Launch is per-thread and asynchronous

Sync is per-block

CUDA primitives are per-block(cannot pass streams/events to children)

cudaDeviceSynchronize() != __syncthreads()

Events allow inter-stream dependencies

Streams are shared within a blockImplicit NULL stream results in ordering within a

block; use named streams

Time

Grid A - Parent

Grid B - Child

Grid A Threads

Grid B Threads

CPU Thread

Grid B Launch

Grid A Launch

Grid B Complete

Grid A Complete

CUDA API available on the device: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#api-reference

Page 24: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

2525

MEMORY CONSISTENCY RULES

Memory Model

Launch implies membar(child sees parent state at time of launch)

Sync implies invalidate(parent sees child writes after sync)

Texture changes by child are

visible to parent after sync (i.e. sync == tex cache invalidate)

Constants are immutable

Local & shared memory are private:

cannot be passed as child kernel args

Time

Grid A - Parent

Grid B - Child

Grid A Threads

Grid B Threads

CPU Thread

Grid B Launch

Grid A Launch

Grid B Complete

Grid A Complete

Fully consistent

Page 25: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

26

EXPERIMENTS

Page 26: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

27

DIRECTED BENCHMARKS

Kernels written to measure specific aspects of dynamic parallelism

Launch throughput

Launch latency

As a function of different configurations

SDK Versions

Varying Clocks

Page 27: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

28

RESULTS – LAUNCH THROUGHPUT

Page 28: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

29

LAUNCH THROUGHPUT

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/875

* K40m-CPU = E5-2690

* Host launches are with 32 streams

Gri

ds/

sec

Num Child kernels launched

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

32 128 512 1024 2048 4096 8192 16384 32768 65536

K40m K40m-CPU

Page 29: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

30

LAUNCH THROUGHPUTObservations

About an order of magnitude higher than from host

Dynamic parallelism is very useful when there are a lot of child kernels

Two major limiters of launch throughput

Pending Launch Count

Grid Scheduler Limit

Page 30: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

3131

PENDING LAUNCH COUNTG

rids/

sec

Num Child kernels launched

* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875

* Different curves represent different pending launch count limits

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

32 128 512 1024 2048 4096 8192 16384 32768 65536

1024 4096 16384 32768

Page 31: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

32

PENDING LAUNCH COUNTObservations

Pre-allocated buffer in Global Memory to store kernels before their launch

Default value – 2048 kernels

Buffer overflow implies resize performed on-the-go

Substantial reduction in launch throughput!

Know the number of pending child kernels!

Page 32: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

33

PENDING LAUNCH COUNTCUDA API’S

4/4/2016

cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount,yourLimit);

cudaDeviceGetLimit(&yourLimit,cudaLimitDevRuntimePendingLaunchCount);

Setting Limit

Querying Limit

Page 33: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

3434

GRID SCHEDULER LIMITG

rids/

sec

Num device streams

* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875

* Different curves represent the total number of child kernels launched

0

500000

1000000

1500000

2000000

2500000

3000000

8 16 32 64 128 256

512 1024 2048 4096 8192 16384

Page 34: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

35

GRID SCHEDULER LIMIT

Ability of grid scheduler to track the number of concurrent kernels

The limit is currently 32

If this limit is crossed, upto 50% loss in launch throughput

Observations

Page 35: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

36

RESULTS – LAUNCH LATENCY

Page 36: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

37

LAUNCH LATENCY

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875

* K40m-CPU = E5-2690

* Host launches are with 32 streams

Tim

e (

ns)

0

5000

10000

15000

20000

25000

30000

K40m K40m-CPU

Initial Subsequent

Page 37: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

38

LAUNCH LATENCYObservations

Initial and subsequent latencies are about 2-3x slower than that of host

Dynamic Parallelism may not be a good choice currently when:

A few child kernels

Serial kernel launches

We are working towards improving this**

** Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, Jin Wang and Sudhakar Yalamanchili,

2014 IEEE International Symposium on Workload Characterization (IISWC).

Page 38: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

39

LAUNCH LATENCY - STREAMSTim

e (

ns)

Host streams

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875

Tim

e (

ns)

Device streams

0

50000

100000

150000

200000

250000

300000

350000

2 4 8 16

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

2 4 8 16

Page 39: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

40

LAUNCH LATENCY - STREAMSObservations

Host streams affect device-side launch latency

Prefer device streams for dynamic parallelism

Page 40: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

41

RESULTS – DEVICE SYNCHRONIZE

Page 41: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

42

DEVICE SYNCHRONIZE

cudaDeviceSynchronize is costly

Avoid it when possible, example below

__global__ void parent() {doSomeInitialization();

childKernel<<<grid,blk>>>();

cudaDeviceSynchronize();}

Unnecessary. Implicit join enforced by the programming

model!

Page 42: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

4343

DEVICE SYNCHRONIZE - COSTTim

e (

ms)

Amount of work per thread (higher the number, more the work)

* Device/SDK = K40/v7.5

0

1

2

3

4

5

6

7

2 4 8 16 32

sync nosync

Page 43: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

44

DEVICE SYNCHRONIZE DEPTH

Deepest recursion level until where cudaDeviceSynchronize works

CUDA limit cudaLimitDevRuntimeSyncDepth controls it

Default is level 2

At the cost of extra global memory reserved for storing parent blocks

Page 44: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

45

DEVICE SYNCHRONIZE DEPTHMemory Usage

0

100

200

300

400

500

600

700

800

2 3 4 5

Mem

ory

Rese

rved (

MB)

Device Synchronize Depth

Page 45: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

46

DEVICE SYNCHRONIZE DEPTH

cudaDeviceSynchronize fails silently beyond the set SyncDepth

Use cudaGetLastError on device to inspect the error

Kernel

(depth=1)

Kernel

(depth=2)

Kernel

(depth=3)Kernel

(depth=4)

Error Handling

Kernel

(depth=5)

SyncDepth=2

Page 46: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

47

DYNAMIC PARALLELISM - LIMITS

Page 47: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

48

DYNAMIC PARALLELISMLimits

Recursion depth is currently 24

Maximum size of formal parameters in the child kernel is 4096 B

Violation causes a compile-time error

Runtime exceptions in child kernel are only visible from host-side

Page 48: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

49

ERROR HANDLING

Visible only from host-side

-lineinfo of nvcc along with cuda-memcheck to locate the error location

Runtime exceptions in child kernels

__global__ void child(float* arr) {arr[0] = 1.0f;

}

__global__ void parent() {child<<<1,1>>>(NULL);cudaDeviceSynchronize();printf(“%d\n”, cudaGetLastError());

}

parent<<<1,1>>>();cudaError_t err = cudaDeviceSynchronize();

Control never reaches here!

Error caught here

Page 49: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

50

SUCCESS STORIES

Page 50: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

51

FMMFast Multipole Method

• Solving the N-body problem

• Computational complexity O(n)

• Tree-based approach

Image source: http://www.bu.edu/pasi/courses/12-steps-to-having-a-fast-multipole-method-on-gpus/

Page 51: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

5252

FMM (2)

• Dynamic 1: launch child grids for neighbors and children

• Dynamic 2: launch child grids for children only

• Dynamic 3: launch child grids for children only; start only p2 kernel threads; use shared GPU memory

Performance

From: FMM goes GPU A smooth trip or bumpy ride?, B. Kohnke, I.Kabadshow MPI BPC Göttingen & Jülich

Supercomputing Centre, GTC2015

low

er

is b

ett

er

Page 52: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

53

PANDAanti-Proton ANnihilation at DArmstadt

• State-of-the-art hadron particle physics experiment

Page 53: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

5454

PANDA (2)

• Avoiding extra PCI-e data transfers.

• Launch configuration data dependencies

• Higher launch throughput

• Reducing false dependencies between kernel launches.

• Waiting on stream prevents enqueuing of work into other streams

Performance and Reasons for Improvements

Source: A CUDA Dynamic Parallelism Case Study: PANDA, Andrew Adinetzhttp://devblogs.nvidia.com/parallelforall/a-cuda-dynamic-parallelism-case-study-panda/

Page 54: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

55

SUMMARY

Page 55: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

56

WHEN TO USE CUDA DYNAMIC PARALLELISMThree Good Reasons

• Algorithmic: “Dynamically Formed Pockets of Structured Parallelism”*

• Unbalanced load (e.g., vertex expansion in graphs, compressed sparse row)

• Tree traversal (fat and shallow computation trees)

• Adaptive Mesh Refinement

• Performance:

• Improve launch throughput

• Reduce PCIe traffic and false dependencies

• Maintenance:

• Simplified, more natural program flow

*) from: Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, J.Wang and S. Yalamanchili, IISWC 2014

Page 56: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

58

REFERENCES

CUDA-C Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-dynamic-parallelism

Adaptive Parallel Computation with CUDA Dynamic Parallelism https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/

FMM goes GPU, B. Kohnke and I.Kabadshow, GTC 2015, https://shar.es/1Y38Vf

Page 57: Silicon Valley DEEP DIVE INTO DYNAMIC …on-demand.gputechconf.com/gtc/2016/presentation/s6807...Dynamic Parallelism may not be a good choice currently when: A few child kernels Serial

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join