Silicon Valley DEEP DIVE INTO DYNAMIC...
Transcript of Silicon Valley DEEP DIVE INTO DYNAMIC...
1
April 4-7, 2016 | Silicon Valley
SHANKARA RAO THEJASWI NANDITALE, NVIDIA
CHRISTOPH ANGERER, NVIDIA
DEEP DIVE INTODYNAMIC PARALLELISM
2
OVERVIEW AND INTRODUCTION
3
WHAT IS DYNAMIC PARALLELISM?
The ability to launch new kernels from the GPU
Dynamically - based on run-time data
Simultaneously - from multiple threads at once
Independently - each thread can launch a different grid
Introduced with CUDA 5.0 and compute capability 3.5 and up
CPU GPU CPU GPU
Fermi: Only CPU can generate GPU work Kepler: GPU can generate work for itself
44
CPU GPU CPU GPU
DYNAMIC PARALLELISM
55
AN EASY TO PARALLELIZE PROGRAM
for i = 1 to Nfor j = 1 to M
convolution(i, j)next j
next i
M
N
66
for i = 1 to Nfor j = 1 to x[i]
convolution(i, j)next j
next i
A DIFFICULT TO PARALLELIZE PROGRAM
77
A DIFFICULT TO PARALLELIZE PROGRAM
for i = 1 to Nfor j = 1 to x[i]
convolution(i, j)next j
next i
N
max(x[i])
N
Bad alternative #2: Tail Effect
Bad alternative #1: Idle Threads
88
DYNAMIC PARALLELISM
for i = 1 to Nfor j = 1 to x[i]
convolution(i, j)next j
next i
Serial Program
__global__ void convolution(int x[]){
for j = 1 to x[blockIdx]kernel<<< ... >>>(blockIdx, j)
}
CUDA Program
With Dynamic Parallelism
N
void main(){
setup(x);convolution<<< N, 1 >>>(x);
}
9
EXPERIMENT
* Device/SDK = K40m/v7.5
* K40m-CPU = E5-2690
Tim
e (
ms)
low
er
is b
ett
er
Matrix Size
0
50
100
150
200
250
300
512 1024 2048 4096 8192 16384
dynpar idleThreads tailEffect
10
LAUNCH EXAMPLE
B<<<1,1>>>()
SM SM SM
Grid Scheduler
SM
Grid A
A0
Task Tracking Structures
A0 Tracking Structure
cudaLaunchDevice( B, 1, 1 );
11
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
B<<<1,1>>>()
Grid A
A0
Task Tracking Structures
A0 Tracking Structure
Allocate Task data structure
12
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
B<<<1,1>>>()
Grid A
A0
Task Tracking Structures
B
A0 Tracking Structure
Fill out Task data structure
13
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
B<<<1,1>>>()
Grid A
A0
A0 Tracking Structure
Task Tracking Structures
B
Track Task B in Block A0
14
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
B<<<1,1
>>>()
Grid A
A0
Task Tracking Structures
A0 Tracking Structure
B
Launch Task B to GPU
15
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
C<<<1,1>>>()
Grid A, Grid B
A0 B0
Task Tracking Structures
A0 Tracking Structure
B
cudaLaunchDevice( C, 1, 1 );
16
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
C<<<1,1>>>()
Grid A, Grid B
A0 B0
Task Tracking Structures
A0 Tracking Structure
B
Allocate, fill out, and track
Task C in block A0
C
17
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
Grid A, Grid B
A0 B0
Task Tracking Structures
A0 Tracking Structure
B C
Task C is not yet runnable.
Track C to run after B.
18
LAUNCH EXAMPLETask Tracking Structures
A0 Tracking Structure
B C
Task B completes.
SKED runs Scheduler.
SM SM SM
Grid Scheduler
SM
Grid A, Scheduler
A0
Task B completes.
Scheduler kernel runs.
19
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
Grid A, Scheduler
A0 Sched
Task Tracking Structures
A0 Tracking Structure
B C
Scheduler searches for work.
20
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
Grid A, Scheduler
A0 Sched
A0 Tracking Structure
Task Tracking Structures
B C
Scheduler completes B, and
Identifies C as ready-to-run.
21
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
Grid A, Scheduler
A0 Sched
C<<<1,1
>>>()
Task Tracking Structures
A0 Tracking Structure
C
Scheduler frees B for re-use, and
launches C to the Grid Scheduler.
22
LAUNCH EXAMPLE
SM SM SM
Grid Scheduler
SM
Grid A, Grid C
A0 C0
Task Tracking Structures
A0 Tracking Structure
C
Task C now executes.
2323
BASIC RULES
Programming Model
Essentially the same as CUDA
Launch is per-thread and asynchronous
Sync is per-block
CUDA primitives are per-block(cannot pass streams/events to children)
cudaDeviceSynchronize() != __syncthreads()
Events allow inter-stream dependencies
Streams are shared within a blockImplicit NULL stream results in ordering within a
block; use named streams
Time
Grid A - Parent
Grid B - Child
Grid A Threads
Grid B Threads
CPU Thread
Grid B Launch
Grid A Launch
Grid B Complete
Grid A Complete
CUDA API available on the device: http://docs.nvidia.com/cuda/cuda-c-programming-guide/#api-reference
2525
MEMORY CONSISTENCY RULES
Memory Model
Launch implies membar(child sees parent state at time of launch)
Sync implies invalidate(parent sees child writes after sync)
Texture changes by child are
visible to parent after sync (i.e. sync == tex cache invalidate)
Constants are immutable
Local & shared memory are private:
cannot be passed as child kernel args
Time
Grid A - Parent
Grid B - Child
Grid A Threads
Grid B Threads
CPU Thread
Grid B Launch
Grid A Launch
Grid B Complete
Grid A Complete
Fully consistent
26
EXPERIMENTS
27
DIRECTED BENCHMARKS
Kernels written to measure specific aspects of dynamic parallelism
Launch throughput
Launch latency
As a function of different configurations
SDK Versions
Varying Clocks
28
RESULTS – LAUNCH THROUGHPUT
29
LAUNCH THROUGHPUT
* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/875
* K40m-CPU = E5-2690
* Host launches are with 32 streams
Gri
ds/
sec
Num Child kernels launched
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
32 128 512 1024 2048 4096 8192 16384 32768 65536
K40m K40m-CPU
30
LAUNCH THROUGHPUTObservations
About an order of magnitude higher than from host
Dynamic parallelism is very useful when there are a lot of child kernels
Two major limiters of launch throughput
Pending Launch Count
Grid Scheduler Limit
3131
PENDING LAUNCH COUNTG
rids/
sec
Num Child kernels launched
* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875
* Different curves represent different pending launch count limits
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
1800000
32 128 512 1024 2048 4096 8192 16384 32768 65536
1024 4096 16384 32768
32
PENDING LAUNCH COUNTObservations
Pre-allocated buffer in Global Memory to store kernels before their launch
Default value – 2048 kernels
Buffer overflow implies resize performed on-the-go
Substantial reduction in launch throughput!
Know the number of pending child kernels!
33
PENDING LAUNCH COUNTCUDA API’S
4/4/2016
cudaDeviceSetLimit(cudaLimitDevRuntimePendingLaunchCount,yourLimit);
cudaDeviceGetLimit(&yourLimit,cudaLimitDevRuntimePendingLaunchCount);
Setting Limit
Querying Limit
3434
GRID SCHEDULER LIMITG
rids/
sec
Num device streams
* Device/SDK/mem-clk,gpu-clk = K40/v7.5/3004,875
* Different curves represent the total number of child kernels launched
0
500000
1000000
1500000
2000000
2500000
3000000
8 16 32 64 128 256
512 1024 2048 4096 8192 16384
35
GRID SCHEDULER LIMIT
Ability of grid scheduler to track the number of concurrent kernels
The limit is currently 32
If this limit is crossed, upto 50% loss in launch throughput
Observations
36
RESULTS – LAUNCH LATENCY
37
LAUNCH LATENCY
* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875
* K40m-CPU = E5-2690
* Host launches are with 32 streams
Tim
e (
ns)
0
5000
10000
15000
20000
25000
30000
K40m K40m-CPU
Initial Subsequent
38
LAUNCH LATENCYObservations
Initial and subsequent latencies are about 2-3x slower than that of host
Dynamic Parallelism may not be a good choice currently when:
A few child kernels
Serial kernel launches
We are working towards improving this**
** Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, Jin Wang and Sudhakar Yalamanchili,
2014 IEEE International Symposium on Workload Characterization (IISWC).
39
LAUNCH LATENCY - STREAMSTim
e (
ns)
Host streams
* Device/SDK/mem-clk,gpu-clk = K40m/v7.5/3004,875
Tim
e (
ns)
Device streams
0
50000
100000
150000
200000
250000
300000
350000
2 4 8 16
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
2 4 8 16
40
LAUNCH LATENCY - STREAMSObservations
Host streams affect device-side launch latency
Prefer device streams for dynamic parallelism
41
RESULTS – DEVICE SYNCHRONIZE
42
DEVICE SYNCHRONIZE
cudaDeviceSynchronize is costly
Avoid it when possible, example below
__global__ void parent() {doSomeInitialization();
childKernel<<<grid,blk>>>();
cudaDeviceSynchronize();}
Unnecessary. Implicit join enforced by the programming
model!
4343
DEVICE SYNCHRONIZE - COSTTim
e (
ms)
Amount of work per thread (higher the number, more the work)
* Device/SDK = K40/v7.5
0
1
2
3
4
5
6
7
2 4 8 16 32
sync nosync
44
DEVICE SYNCHRONIZE DEPTH
Deepest recursion level until where cudaDeviceSynchronize works
CUDA limit cudaLimitDevRuntimeSyncDepth controls it
Default is level 2
At the cost of extra global memory reserved for storing parent blocks
45
DEVICE SYNCHRONIZE DEPTHMemory Usage
0
100
200
300
400
500
600
700
800
2 3 4 5
Mem
ory
Rese
rved (
MB)
Device Synchronize Depth
46
DEVICE SYNCHRONIZE DEPTH
cudaDeviceSynchronize fails silently beyond the set SyncDepth
Use cudaGetLastError on device to inspect the error
Kernel
(depth=1)
Kernel
(depth=2)
Kernel
(depth=3)Kernel
(depth=4)
Error Handling
Kernel
(depth=5)
SyncDepth=2
47
DYNAMIC PARALLELISM - LIMITS
48
DYNAMIC PARALLELISMLimits
Recursion depth is currently 24
Maximum size of formal parameters in the child kernel is 4096 B
Violation causes a compile-time error
Runtime exceptions in child kernel are only visible from host-side
49
ERROR HANDLING
Visible only from host-side
-lineinfo of nvcc along with cuda-memcheck to locate the error location
Runtime exceptions in child kernels
__global__ void child(float* arr) {arr[0] = 1.0f;
}
__global__ void parent() {child<<<1,1>>>(NULL);cudaDeviceSynchronize();printf(“%d\n”, cudaGetLastError());
}
parent<<<1,1>>>();cudaError_t err = cudaDeviceSynchronize();
Control never reaches here!
Error caught here
50
SUCCESS STORIES
51
FMMFast Multipole Method
• Solving the N-body problem
• Computational complexity O(n)
• Tree-based approach
Image source: http://www.bu.edu/pasi/courses/12-steps-to-having-a-fast-multipole-method-on-gpus/
5252
FMM (2)
• Dynamic 1: launch child grids for neighbors and children
• Dynamic 2: launch child grids for children only
• Dynamic 3: launch child grids for children only; start only p2 kernel threads; use shared GPU memory
Performance
From: FMM goes GPU A smooth trip or bumpy ride?, B. Kohnke, I.Kabadshow MPI BPC Göttingen & Jülich
Supercomputing Centre, GTC2015
low
er
is b
ett
er
53
PANDAanti-Proton ANnihilation at DArmstadt
• State-of-the-art hadron particle physics experiment
5454
PANDA (2)
• Avoiding extra PCI-e data transfers.
• Launch configuration data dependencies
• Higher launch throughput
• Reducing false dependencies between kernel launches.
• Waiting on stream prevents enqueuing of work into other streams
Performance and Reasons for Improvements
Source: A CUDA Dynamic Parallelism Case Study: PANDA, Andrew Adinetzhttp://devblogs.nvidia.com/parallelforall/a-cuda-dynamic-parallelism-case-study-panda/
55
SUMMARY
56
WHEN TO USE CUDA DYNAMIC PARALLELISMThree Good Reasons
• Algorithmic: “Dynamically Formed Pockets of Structured Parallelism”*
• Unbalanced load (e.g., vertex expansion in graphs, compressed sparse row)
• Tree traversal (fat and shallow computation trees)
• Adaptive Mesh Refinement
• Performance:
• Improve launch throughput
• Reduce PCIe traffic and false dependencies
• Maintenance:
• Simplified, more natural program flow
*) from: Characterization and Analysis of Dynamic Parallelism in Unstructured GPU Applications, J.Wang and S. Yalamanchili, IISWC 2014
58
REFERENCES
CUDA-C Programming Guide, http://docs.nvidia.com/cuda/cuda-c-programming-guide/#cuda-dynamic-parallelism
Adaptive Parallel Computation with CUDA Dynamic Parallelism https://devblogs.nvidia.com/parallelforall/introduction-cuda-dynamic-parallelism/
FMM goes GPU, B. Kohnke and I.Kabadshow, GTC 2015, https://shar.es/1Y38Vf
April 4-7, 2016 | Silicon Valley
THANK YOU
JOIN THE NVIDIA DEVELOPER PROGRAM AT developer.nvidia.com/join