Silicon Valley DEEP DIVE INTO DYNAMIC...

1

April 4-7, 2016 | Silicon Valley

SHANKARA RAO THEJASWI NANDITALE, NVIDIA

CHRISTOPH ANGERER, NVIDIA

DEEP DIVE INTODYNAMIC PARALLELISM

2

OVERVIEW AND INTRODUCTION

3

WHAT IS DYNAMIC PARALLELISM?

The ability to launch new kernels from the GPU

Dynamically - based on run-time data

Simultaneously - from multiple threads at once

Independently - each thread can launch a different grid

Introduced with CUDA 5.0 and compute capability 3.5 and up

CPU GPU CPU GPU

Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself

44

CPU GPU CPU GPU

DYNAMIC PARALLELISM

55

AN EASY TO PARALLELIZE PROGRAM

for i = 1 to Nfor j = 1 to M

convolution(i, j)next j

next i

M

N

66

for i = 1 to Nfor j = 1 to x[i]


next i

A DIFFICULT TO PARALLELIZE PROGRAM

77

A DIFFICULT TO PARALLELIZE PROGRAM



next i

N

max(x[i])

N

Bad alternative #2: Tail Effect

Bad alternative #1: Idle Threads

88

DYNAMIC PARALLELISM



next i

Serial Program

__global__ void convolution(int x[]){

for j = 1 to x[blockIdx]kernel<<< ... >>>(blockIdx, j)

}

CUDA Program

With Dynamic Parallelism

N

void main(){

setup(x);convolution<<< N, 1 >>>(x);

}

9

EXPERIMENT

* Device/SDK = K40m/v7.5

* K40m-CPU = E5-2690

Tim

e (

ms)

low

er

is b

ett

er

Matrix Size

0

50

100

150

200

250

300

512 1024 2048 4096 8192 16384

dynpar idleThreads tailEffect

10

LAUNCH EXAMPLE

B<<<1,1>>>()

SM SM SM

Grid Scheduler

SM

Grid A

A0

Task Tracking Structures

A0 Tracking Structure

cudaLaunchDevice( B, 1, 1 );

11

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A

A0



Allocate Task data structure

12

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A

A0


B


Fill out Task data structure

13

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1>>>()

Grid A

A0



B

Track Task B in Block A0

14

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

B<<<1,1

>>>()

Grid A

A0



B

Launch Task B to GPU

15

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

C<<<1,1>>>()

Grid A, Grid B

A0 B0



B

cudaLaunchDevice( C, 1, 1 );

16

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

C<<<1,1>>>()

Grid A, Grid B

A0 B0



B

Allocate, fill out, and track

Task C in block A0

C

17

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Grid B

A0 B0



B C

Task C is not yet runnable.

Track C to run after B.

18

LAUNCH EXAMPLETask Tracking Structures


B C

Task B completes.

SKED runs Scheduler.

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler

A0

Task B completes.

Scheduler kernel runs.

19

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler

A0 Sched



B C

Scheduler searches for work.

20

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler

A0 Sched



B C

Scheduler completes B, and

Identifies C as ready-to-run.

21

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Scheduler

A0 Sched

C<<<1,1

>>>()



C

Scheduler frees B for re-use, and

launches C to the Grid Scheduler.

22

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

SM

Grid A, Grid C

A0 C0



C

Task C now executes.

2323

BASIC RULES

Programming Model

Essentially the same as CUDA

Launch is per-thread and asynchronous

Sync is per-block

CUDA primitives are per-block(cannot pass streams/events to children)

cudaDeviceSynchronize() != __syncthreads()

Events allow inter-stream dependencies

Streams are shared within a blockImplicit NULL stream results in ordering within a

block; use named streams

Time

Grid A - Parent

Grid B - Child

Grid A Threads

Grid B Threads

CPU Thread

Grid B Launch

Grid A Launch

Grid B Complete

Grid A Complete

CUDA API available on the device: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#api-reference

2525

MEMORY CONSISTENCY RULES

Memory Model

Launch implies membar(child sees parent state at time of launch)

Sync implies invalidate(parent sees child writes after sync)

Texture changes by child are

visible to parent after sync (i.e. sync == tex cache invalidate)

Constants are immutable

Local & shared memory are private:

cannot be passed as child kernel args

Time

Grid A - Parent

Grid B - Child

Grid A Threads

Grid B Threads

CPU Thread

Grid B Launch

Grid A Launch

Grid B Complete

Grid A Complete

Fully consistent

26

EXPERIMENTS

27

DIRECTED BENCHMARKS

Kernels written to measure specific aspects of dynamic parallelism

Launch throughput

Launch latency

As a function of different configurations

SDK Versions

Varying Clocks

28

RESULTS – LAUNCH THROUGHPUT

29

LAUNCH THROUGHPUT

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/875

* K40m-CPU = E5-2690

* Host launches are with 32 streams

Gri

ds/

sec

Num Child kernels launched

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

32 128 512 1024 2048 4096 8192 16384 32768 65536

K40m K40m-CPU

30

LAUNCH THROUGHPUTObservations

About an order of magnitude higher than from host

Dynamic parallelism is very useful when there are a lot of child kernels

Two major limiters of launch throughput

Pending Launch Count

Grid Scheduler Limit

3131

PENDING LAUNCH COUNTG

rids/

sec

Num Child kernels launched

* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875

* Different curves represent different pending launch count limits

0

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

32 128 512 1024 2048 4096 8192 16384 32768 65536

1024 4096 16384 32768

32

PENDING LAUNCH COUNTObservations

Pre-allocated buffer in Global Memory to store kernels before their launch

Default value – 2048 kernels

Buffer overflow implies resize performed on-the-go

Substantial reduction in launch throughput!

Know the number of pending child kernels!

33

PENDING LAUNCH COUNTCUDA API’S

4/4/2016

cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount,yourLimit);

cudaDeviceGetLimit(&yourLimit,cudaLimitDevRuntimePendingLaunchCount);

Setting Limit

Querying Limit

3434

GRID SCHEDULER LIMITG

rids/

sec

Num device streams

* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875

* Different curves represent the total number of child kernels launched

0

500000

1000000

1500000

2000000

2500000

3000000

8 16 32 64 128 256

512 1024 2048 4096 8192 16384

35

GRID SCHEDULER LIMIT

Ability of grid scheduler to track the number of concurrent kernels

The limit is currently 32

If this limit is crossed, upto 50% loss in launch throughput

Observations

36

RESULTS – LAUNCH LATENCY

37

LAUNCH LATENCY

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875

* K40m-CPU = E5-2690

* Host launches are with 32 streams

Tim

e (

ns)

0

5000

10000

15000

20000

25000

30000

K40m K40m-CPU

Initial Subsequent

38

LAUNCH LATENCYObservations

Initial and subsequent latencies are about 2-3x slower than that of host

Dynamic Parallelism may not be a good choice currently when:

A few child kernels

Serial kernel launches

We are working towards improving this**

** Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, Jin Wang and Sudhakar Yalamanchili,

2014 IEEE International Symposium on Workload Characterization (IISWC).

39

LAUNCH LATENCY - STREAMSTim

e (

ns)

Host streams

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875

Tim

e (

ns)

Device streams

0

50000

100000

150000

200000

250000

300000

350000

2 4 8 16

0

2000

4000

6000

8000

10000

12000

14000

16000

18000

2 4 8 16

40

LAUNCH LATENCY - STREAMSObservations

Host streams affect device-side launch latency

Prefer device streams for dynamic parallelism

41

RESULTS – DEVICE SYNCHRONIZE

42

DEVICE SYNCHRONIZE

cudaDeviceSynchronize is costly

Avoid it when possible, example below

__global__ void parent() {doSomeInitialization();

childKernel<<<grid,blk>>>();

cudaDeviceSynchronize();}

Unnecessary. Implicit join enforced by the programming

model!

4343

DEVICE SYNCHRONIZE - COSTTim

e (

ms)

Amount of work per thread (higher the number, more the work)

* Device/SDK = K40/v7.5

0

1

2

3

4

5

6

7

2 4 8 16 32

sync nosync

44

DEVICE SYNCHRONIZE DEPTH

Deepest recursion level until where cudaDeviceSynchronize works

CUDA limit cudaLimitDevRuntimeSyncDepth controls it

Default is level 2

At the cost of extra global memory reserved for storing parent blocks

45

DEVICE SYNCHRONIZE DEPTHMemory Usage

0

100

200

300

400

500

600

700

800

2 3 4 5

Mem

ory

Rese

rved (

MB)

Device Synchronize Depth

46

DEVICE SYNCHRONIZE DEPTH

cudaDeviceSynchronize fails silently beyond the set SyncDepth

Use cudaGetLastError on device to inspect the error

Kernel

(depth=1)

Kernel

(depth=2)

Kernel

(depth=3)Kernel

(depth=4)

Error Handling

Kernel

(depth=5)

SyncDepth=2

47

DYNAMIC PARALLELISM - LIMITS

48

DYNAMIC PARALLELISMLimits

Recursion depth is currently 24

Maximum size of formal parameters in the child kernel is 4096 B

Violation causes a compile-time error

Runtime exceptions in child kernel are only visible from host-side

49

ERROR HANDLING

Visible only from host-side

-lineinfo of nvcc along with cuda-memcheck to locate the error location

Runtime exceptions in child kernels

__global__ void child(float* arr) {arr[0] = 1.0f;

}

__global__ void parent() {child<<<1,1>>>(NULL);cudaDeviceSynchronize();printf(“%d\n”, cudaGetLastError());

}

parent<<<1,1>>>();cudaError_t err = cudaDeviceSynchronize();

Control never reaches here!

Error caught here

50

SUCCESS STORIES

51

FMMFast Multipole Method

• Solving the N-body problem

• Computational complexity O(n)

• Tree-based approach

Image source: http://www.bu.edu/pasi/courses/12-steps-to-having-a-fast-multipole-method-on-gpus/

5252

FMM (2)

• Dynamic 1: launch child grids for neighbors and children

• Dynamic 2: launch child grids for children only

• Dynamic 3: launch child grids for children only; start only p2 kernel threads; use shared GPU memory

Performance

From: FMM goes GPU A smooth trip or bumpy ride?, B. Kohnke, I.Kabadshow MPI BPC Göttingen & Jülich

Supercomputing Centre, GTC2015

low

er

is b

ett

er

53

PANDAanti-Proton ANnihilation at DArmstadt

• State-of-the-art hadron particle physics experiment

5454

PANDA (2)

• Avoiding extra PCI-e data transfers.

• Launch configuration data dependencies

• Higher launch throughput

• Reducing false dependencies between kernel launches.

• Waiting on stream prevents enqueuing of work into other streams

Performance and Reasons for Improvements

Source: A CUDA Dynamic Parallelism Case Study: PANDA, Andrew Adinetzhttp://devblogs.nvidia.com/parallelforall/a-cuda-dynamic-parallelism-case-study-panda/

55

SUMMARY

56

WHEN TO USE CUDA DYNAMIC PARALLELISMThree Good Reasons

• Algorithmic: “Dynamically Formed Pockets of Structured Parallelism”*

• Unbalanced load (e.g., vertex expansion in graphs, compressed sparse row)

• Tree traversal (fat and shallow computation trees)

• Adaptive Mesh Refinement

• Performance:

• Improve launch throughput

• Reduce PCIe traffic and false dependencies

• Maintenance:

• Simplified, more natural program flow

*) from: Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, J.Wang and S. Yalamanchili, IISWC 2014

58

REFERENCES

CUDA-C Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-dynamic-parallelism

Adaptive Parallel Computation with CUDA Dynamic Parallelism https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/

FMM goes GPU, B. Kohnke and I.Kabadshow, GTC 2015, https://shar.es/1Y38Vf

http://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-dynamic-parallelism

https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/

https://shar.es/1Y38Vf

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

developer.nvidia.com/join

Silicon Valley DEEP DIVE INTO DYNAMIC...

Documents

Transcript of Silicon Valley DEEP DIVE INTO DYNAMIC...