Silicon Valley DEEP DIVE INTO DYNAMIC...

April 4-7, 2016 | Silicon Valley

SHANKARA RAO THEJASWI NANDITALE, NVIDIA

CHRISTOPH ANGERER, NVIDIA

DEEP DIVE INTODYNAMIC PARALLELISM

OVERVIEW AND INTRODUCTION

WHAT IS DYNAMIC PARALLELISM?

The ability to launch new kernels from the GPU

Dynamically - based on run-time data

Simultaneously - from multiple threads at once

Independently - each thread can launch a different grid

Introduced with CUDA 5.0 and compute capability 3.5 and up

CPU GPU CPU GPU

Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself

CPU GPU CPU GPU

DYNAMIC PARALLELISM

AN EASY TO PARALLELIZE PROGRAM

for i = 1 to Nfor j = 1 to M

convolution(i, j)next j

next i

for i = 1 to Nfor j = 1 to x[i]

next i

A DIFFICULT TO PARALLELIZE PROGRAM

next i

max(x[i])

Bad alternative #2: Tail Effect

Bad alternative #1: Idle Threads

DYNAMIC PARALLELISM

next i

Serial Program

__global__ void convolution(int x[]){

for j = 1 to x[blockIdx]kernel<<< ... >>>(blockIdx, j)

CUDA Program

With Dynamic Parallelism

void main(){

setup(x);convolution<<< N, 1 >>>(x);

EXPERIMENT

* Device/SDK = K40m/v7.5

* K40m-CPU = E5-2690

Matrix Size

512 1024 2048 4096 8192 16384

dynpar idleThreads tailEffect

LAUNCH EXAMPLE

B<<<1,1>>>()

SM SM SM

Grid Scheduler

Grid A

Task Tracking Structures

A0 Tracking Structure

cudaLaunchDevice( B, 1, 1 );

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

B<<<1,1>>>()

Grid A

Allocate Task data structure

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

B<<<1,1>>>()

Grid A

Fill out Task data structure

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

B<<<1,1>>>()

Grid A

Track Task B in Block A0

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

B<<<1,1

Grid A

Launch Task B to GPU

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

C<<<1,1>>>()

Grid A, Grid B

cudaLaunchDevice( C, 1, 1 );

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

C<<<1,1>>>()

Grid A, Grid B

Allocate, fill out, and track

Task C in block A0

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

Grid A, Grid B

Task C is not yet runnable.

Track C to run after B.

LAUNCH EXAMPLETask Tracking Structures

Task B completes.

SKED runs Scheduler.

SM SM SM

Grid Scheduler

Grid A, Scheduler

Task B completes.

Scheduler kernel runs.

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

Grid A, Scheduler

A0 Sched

Scheduler searches for work.

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

Grid A, Scheduler

A0 Sched

Scheduler completes B, and

Identifies C as ready-to-run.

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

Grid A, Scheduler

A0 Sched

C<<<1,1

Scheduler frees B for re-use, and

launches C to the Grid Scheduler.

LAUNCH EXAMPLE

SM SM SM

Grid Scheduler

Grid A, Grid C

Task C now executes.

BASIC RULES

Programming Model

Essentially the same as CUDA

Launch is per-thread and asynchronous

Sync is per-block

CUDA primitives are per-block(cannot pass streams/events to children)

cudaDeviceSynchronize() != __syncthreads()

Events allow inter-stream dependencies

Streams are shared within a blockImplicit NULL stream results in ordering within a

block; use named streams

Grid A - Parent

Grid B - Child

Grid A Threads

Grid B Threads

CPU Thread

Grid B Launch

Grid A Launch

Grid B Complete

Grid A Complete

CUDA API available on the device: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#api-reference

MEMORY CONSISTENCY RULES

Memory Model

Launch implies membar(child sees parent state at time of launch)

Sync implies invalidate(parent sees child writes after sync)

Texture changes by child are

visible to parent after sync (i.e. sync == tex cache invalidate)

Constants are immutable

Local & shared memory are private:

cannot be passed as child kernel args

Grid A - Parent

Grid B - Child

Grid A Threads

Grid B Threads

CPU Thread

Grid B Launch

Grid A Launch

Grid B Complete

Grid A Complete

Fully consistent

EXPERIMENTS

DIRECTED BENCHMARKS

Kernels written to measure specific aspects of dynamic parallelism

Launch throughput

Launch latency

As a function of different configurations

SDK Versions

Varying Clocks

RESULTS – LAUNCH THROUGHPUT

LAUNCH THROUGHPUT

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/875

* K40m-CPU = E5-2690

* Host launches are with 32 streams

Num Child kernels launched

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

32 128 512 1024 2048 4096 8192 16384 32768 65536

K40m K40m-CPU

LAUNCH THROUGHPUTObservations

About an order of magnitude higher than from host

Dynamic parallelism is very useful when there are a lot of child kernels

Two major limiters of launch throughput

Pending Launch Count

Grid Scheduler Limit

PENDING LAUNCH COUNTG

Num Child kernels launched

* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875

* Different curves represent different pending launch count limits

200000

400000

600000

800000

1000000

1200000

1400000

1600000

1800000

32 128 512 1024 2048 4096 8192 16384 32768 65536

1024 4096 16384 32768

PENDING LAUNCH COUNTObservations

Pre-allocated buffer in Global Memory to store kernels before their launch

Default value – 2048 kernels

Buffer overflow implies resize performed on-the-go

Substantial reduction in launch throughput!

Know the number of pending child kernels!

PENDING LAUNCH COUNTCUDA API’S

4/4/2016

cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount,yourLimit);

cudaDeviceGetLimit(&yourLimit,cudaLimitDevRuntimePendingLaunchCount);

Setting Limit

Querying Limit

GRID SCHEDULER LIMITG

Num device streams

* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875

* Different curves represent the total number of child kernels launched

500000

1000000

1500000

2000000

2500000

3000000

8 16 32 64 128 256

512 1024 2048 4096 8192 16384

GRID SCHEDULER LIMIT

Ability of grid scheduler to track the number of concurrent kernels

The limit is currently 32

If this limit is crossed, upto 50% loss in launch throughput

Observations

RESULTS – LAUNCH LATENCY

LAUNCH LATENCY

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875

* K40m-CPU = E5-2690

* Host launches are with 32 streams

K40m K40m-CPU

Initial Subsequent

LAUNCH LATENCYObservations

Initial and subsequent latencies are about 2-3x slower than that of host

Dynamic Parallelism may not be a good choice currently when:

A few child kernels

Serial kernel launches

We are working towards improving this**

** Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, Jin Wang and Sudhakar Yalamanchili,

2014 IEEE International Symposium on Workload Characterization (IISWC).

LAUNCH LATENCY - STREAMSTim

Host streams

* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875

Device streams

100000

150000

200000

250000

300000

350000

2 4 8 16

LAUNCH LATENCY - STREAMSObservations

Host streams affect device-side launch latency

Prefer device streams for dynamic parallelism

RESULTS – DEVICE SYNCHRONIZE

DEVICE SYNCHRONIZE

cudaDeviceSynchronize is costly

Avoid it when possible, example below

__global__ void parent() {doSomeInitialization();

childKernel<<<grid,blk>>>();

cudaDeviceSynchronize();}

Unnecessary. Implicit join enforced by the programming

model!

DEVICE SYNCHRONIZE - COSTTim

Amount of work per thread (higher the number, more the work)

* Device/SDK = K40/v7.5

2 4 8 16 32

sync nosync

DEVICE SYNCHRONIZE DEPTH

Deepest recursion level until where cudaDeviceSynchronize works

CUDA limit cudaLimitDevRuntimeSyncDepth controls it

Default is level 2

At the cost of extra global memory reserved for storing parent blocks

DEVICE SYNCHRONIZE DEPTHMemory Usage

2 3 4 5

rved (

Device Synchronize Depth

DEVICE SYNCHRONIZE DEPTH

cudaDeviceSynchronize fails silently beyond the set SyncDepth

Use cudaGetLastError on device to inspect the error

Kernel

(depth=1)

Kernel

(depth=2)

Kernel

(depth=3)Kernel

(depth=4)

Error Handling

Kernel

(depth=5)

SyncDepth=2

DYNAMIC PARALLELISM - LIMITS

DYNAMIC PARALLELISMLimits

Recursion depth is currently 24

Maximum size of formal parameters in the child kernel is 4096 B

Violation causes a compile-time error

Runtime exceptions in child kernel are only visible from host-side

ERROR HANDLING

Visible only from host-side

-lineinfo of nvcc along with cuda-memcheck to locate the error location

Runtime exceptions in child kernels

__global__ void child(float* arr) {arr[0] = 1.0f;

__global__ void parent() {child<<<1,1>>>(NULL);cudaDeviceSynchronize();printf(“%d\n”, cudaGetLastError());

parent<<<1,1>>>();cudaError_t err = cudaDeviceSynchronize();

Control never reaches here!

Error caught here

SUCCESS STORIES

FMMFast Multipole Method

• Solving the N-body problem

• Computational complexity O(n)

• Tree-based approach

Image source: http://www.bu.edu/pasi/courses/12-steps-to-having-a-fast-multipole-method-on-gpus/

FMM (2)

• Dynamic 1: launch child grids for neighbors and children

• Dynamic 2: launch child grids for children only

• Dynamic 3: launch child grids for children only; start only p2 kernel threads; use shared GPU memory

Performance

From: FMM goes GPU A smooth trip or bumpy ride?, B. Kohnke, I.Kabadshow MPI BPC Göttingen & Jülich

Supercomputing Centre, GTC2015

PANDAanti-Proton ANnihilation at DArmstadt

• State-of-the-art hadron particle physics experiment

PANDA (2)

• Avoiding extra PCI-e data transfers.

• Launch configuration data dependencies

• Higher launch throughput

• Reducing false dependencies between kernel launches.

• Waiting on stream prevents enqueuing of work into other streams

Performance and Reasons for Improvements

Source: A CUDA Dynamic Parallelism Case Study: PANDA, Andrew Adinetzhttp://devblogs.nvidia.com/parallelforall/a-cuda-dynamic-parallelism-case-study-panda/

SUMMARY

WHEN TO USE CUDA DYNAMIC PARALLELISMThree Good Reasons

• Algorithmic: “Dynamically Formed Pockets of Structured Parallelism”*

• Unbalanced load (e.g., vertex expansion in graphs, compressed sparse row)

• Tree traversal (fat and shallow computation trees)

• Adaptive Mesh Refinement

• Performance:

• Improve launch throughput

• Reduce PCIe traffic and false dependencies

• Maintenance:

• Simplified, more natural program flow

*) from: Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, J.Wang and S. Yalamanchili, IISWC 2014

REFERENCES

CUDA-C Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-dynamic-parallelism

Adaptive Parallel Computation with CUDA Dynamic Parallelism https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/

FMM goes GPU, B. Kohnke and I.Kabadshow, GTC 2015, https://shar.es/1Y38Vf

April 4-7, 2016 | Silicon Valley

THANK YOU

JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join

Silicon Valley DEEP DIVE INTO DYNAMIC...

Documents

Transcript of Silicon Valley DEEP DIVE INTO DYNAMIC...

Dynamic Cuda with F# - GTC On-Demand Featured …on-demand.gputechconf.com/.../S3055-Dynamic-CUDA-with-F-Sharp … · Dynamic Cuda with F# GTC 2013 March 21 ... Inline PTX assembly

OPTIMIZED GPU KERNELS FOR DEEP LEARNINGon-demand.gputechconf.com/.../S5873-Amir-Khosrowshahi.pdf · 2015-03-18 · OPTIMIZED GPU KERNELS FOR DEEP LEARNING Amir Khosrowshahi GTC 17

Kernels - Vermont Foodbank

Graph Kernels - University of Chicagopeople.cs.uchicago.edu/.../VishwanathanGraphKernelsJMLR.pdfGRAPH KERNELS trast, in this paper we focus on kernels between graphs. The ﬁrst such

String Kernels

COMPILING OPENCL KERNELS

Knn Kernels Margin

April 4-7, 2016 | Silicon Valley DEEP DIVE INTO DYNAMIC ...€¦ · 3 WHAT IS DYNAMIC PARALLELISM? The ability to launch new kernels from the GPU Dynamically - based on run-time data

Reproducing Kernels in Hilbert Spaces - math.unl.edujorr1/classes/2009summer/math896/... · Reproducing Kernels in Hilbert Spaces Thomas Clark August 10 - 11, 2009. Reproducing Kernels

cuSTINGER - Supporting Dynamic Graph Algorithms for GPUson-demand.gputechconf.com/gtcdc/2016/presentation/...• Enable algorithm designers to implement dynamic & streaming graph algorithms

Kernels Review

Dynamic Compilation of Data-Parallel Kernels for Vector ...gpuocelot.gatech.edu/wp-content/uploads/ocelot-vectorization.pdf · Dynamic Compilation of Data-Parallel Kernels for Vector

Troubleshooting E1 Kernels

Real Time Kernels

NVIDIA VXGIon-demand.gputechconf.com/gtc/2015/presentation/S... · NVIDIA VXGI DYNAMIC GLOBAL ILLUMINATION FOR GAMES . OUTLINE What is VXGI Algorithm Overview Engine Integration VXGI

Cashew kernels Specification...Cashew kernels— Specification 1 Scope This Northern corridor standard specifies requirements and methods of sampling and test for kernels obtained

kernels - arXiv · among di erent kernels such security kernels, partition kernels and hypervisors. 2.1. What’s the Separation Kernel Separation kernel is a type of security kernels

Dynamic and Adaptive Updates of Non-Quiescent Subsystems in Commodity OS Kernels

Virtual Automotive: Projection Mapped - NVIDIAon-demand.gputechconf.com/gtc/2014/presentations/S... · experience through Projected Light • Add dynamic details with the use of projection

Kernels for Dynamic Textures - Purdue Universityvishy/talks/Dynamic.pdf · 2009. 8. 22. · Dynamic Texture Kernel S.V.N. Vishwanathan: Kernels for Dynamic Textures, Page 25 Kernel