EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES...
Transcript of EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES...
Nikolay Sakharnykh, 3/27/2018
EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY
2
SINGLE POINTER
void *data;
data = malloc(N);
cpu_func1(data, N);
cpu_func2(data, N);
cpu_func3(data, N);
free(data);
void *data;
data = malloc(N);
cpu_func1(data, N);
gpu_func2<<<...>>>(data, N);
cudaDeviceSynchronize();
cpu_func3(data, N);
free(data);
CPU vs GPU
CPU code GPU code w/ Unified Memory
3
SINGLE POINTER
void *data, *d_data;
data = malloc(N);
cudaMalloc(&d_data, N);
cpu_func1(data, N);
cudaMemcpy(d_data, data, N, ...)
gpu_func2<<<...>>>(d_data, N);
cudaMemcpy(data, d_data, N, ...)
cudaFree(d_data);
cpu_func3(data, N);
free(data);
void *data;
data = malloc(N);
cpu_func1(data, N);
gpu_func2<<<...>>>(data, N);
cudaDeviceSynchronize();
cpu_func3(data, N);
free(data);
Explicit vs Unified Memory
Explicit Memory Management GPU code w/ Unified Memory
4
SINGLE POINTER
void *data, *d_data;
data = malloc(N);
cudaMalloc(&d_data, N);
cpu_func1(data, N);
cudaMemcpy(d_data, data, N, ...)
gpu_func2<<<...>>>(d_data, N);
cudaMemcpy(data, d_data, N, ...)
cudaFree(d_data);
cpu_func3(data, N);
free(data);
void *data;
data = malloc(N);
cpu_func1(data, N);
cudaMemPrefetchAsync(data, N, GPU)
gpu_func2<<<...>>>(data, N);
cudaMemPrefetchAsync(data, N, CPU)
cudaDeviceSynchronize();
cpu_func3(data, N);
free(data);
Full Control with Prefetching
Explicit Memory Management Unified Memory + Prefetching
5
SINGLE POINTER
char **data;
// allocate and initialize data on the CPU
char **d_data;
char **h_data = (char**)malloc(N*sizeof(char*));
for (int i = 0; i < N; i++) {
cudaMalloc(&h_data[i], N);
cudaMemcpy(h_data[i], data[i], N, ...);
}
cudaMalloc(&d_data, N*sizeof(char*));
cudaMemcpy(d_data, h_data, N*sizeof(char*), ...);
gpu_func<<<...>>>(d_data, N);
char **data;
// allocate and initialize data on the CPU
gpu_func<<<...>>>(data, N);
Deep Copy
Explicit Memory Management GPU code w/ Unified Memory
6
UNIFIED MEMORY BASICS
GPU A GPU B
page1 page2
Single virtual memory shared between processors
7
UNIFIED MEMORY BASICS
page1
page2
GPU A GPU B
A’s page table
A’s phys mem B’s phys mem page1
page2
B’s page table
page1 page2
Single virtual memory shared between processors
8
UNIFIED MEMORY BASICS
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
*addr1 = 1local access
*addr3 = 1page fault
A’s page table B’s page table
9
UNIFIED MEMORY BASICS
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
*addr3 = 1access replay
page3 populated and mapped into B’s memory
A’s page table B’s page table
10
UNIFIED MEMORY BASICS
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
*addr2 = 1page fault
*addr3 = 1page fault
A’s page table B’s page table
11
UNIFIED MEMORY BASICS
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
*addr2 = 1page fault
*addr3 = 1page fault
page2 and page3 unmapped from B’s memory
A’s page table B’s page table
12
UNIFIED MEMORY BASICS
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
*addr2 = 1page fault
*addr3 = 1page fault
pages data migrated to A’s physical memory
A’s page table B’s page table
13
UNIFIED MEMORY BASICS
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
*addr2 = 1access replay
*addr3 = 1access replay
A’s page table B’s page table
14
MEMORY OVERSUBSCRIPTION
page1
page1
page3
page4
page5
A’s phys mem B’s phys mem
page1
page1
page3
page4
page5*addr5 = 1page fault
A’s page table B’s page table
15
MEMORY OVERSUBSCRIPTION
page1
page1
page3
page4
page5
A’s phys mem B’s phys mem
page1
page1
page3
page4
page5*addr5 = 1page fault
page4 unmapped from A’s memory and migrated
A’s page table B’s page table
16
MEMORY OVERSUBSCRIPTION
page1
page1
page3
page4
page5
A’s phys mem B’s phys mem
page1
page1
page3
page4
page5*addr5 = 1page fault
page4 mapped in B’s memory, page5 unmapped and migrated to A
A’s page table B’s page table
17
MEMORY OVERSUBSCRIPTION
page1
page1
page3
page4
page5
A’s phys mem B’s phys mem
page1
page1
page3
page4
page5*addr5 = 1
access replay
A’s page table B’s page table
18
SIMPLIFYING DL FRAMEWORK DESIGN
Eliminated 3,000 lines of repetitive and error-prone code in Caffe
Developers can add new inherited Layer classes in a much simpler manner
The final call to a CPU function or a GPU kernel (caffe_gpu_gemm) still need to be explicit
class ConvolutionLayer{public:
void cpu_data()void cpu_diff()void gpu_data()void gpu_diff()
void mutable_cpu_data()void mutable_cpu_diff()void mutable_gpu_data()void mutable_gpu_diff()
void Forward_cpu()void Forward_gpu()void forward_cpu_gemm()void forward_gpu_gemm()void forward_cpu_bias()void forward_gpu_bias()
void Backward_cpu()void Backward_gpu()void backward_cpu_gemm()void backward_gpu_gemm()void backward_cpu_bias()void backward_gpu_bias()
}
class ConvolutionLayer{public:
void data()void diff()
void mutable_data()void mutable_diff()
void Forward()void forward_gemm()void forward_bias()
void Backward()void backward_gemm()void backward_bias()
}
Existing Design
Unified Memory
Design
A. A. Awan, C-H Chu, H. Subramoni, X. Lu, D.K. Panda, “OC-DNN: Designing
Out-of-Core Deep Neural Network Training by Exploiting Unified Memory on
Pascal and Volta GPUs”, <double-blind submission under review>
19
CAN THIS DESIGN OFFER GOOD PERF?DL training with Unified Memory
OC-Caffe will be released by the HiDL Team@OSU: hidl.cse.ohio-state.edu, mvapich.cse.ohio-state.edu
0
20
40
60
80
100
120
140
160
Batch 40 Batch 45Im
ages
/sec
ResNet-50 training on 1xV100 (16GB)
Caffe (BVLC) OC-Caffe
0
20
40
60
80
100
120
140
160
Batch 110 Batch 120
Imag
es/s
ec
VGG19 training on 1xV100 (16GB)
Caffe (BVLC) OC-Caffe
out-of-corein-memory out-of-corein-memory
20
CONCURRENT ACCESS
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
A’s page table B’s page table
21
CONCURRENT ACCESSExclusive Access*
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
atomicAdd_system(addr2, 1)
page fault
atomicAdd_system(addr2, 1)
local access
A’s page table B’s page table
*this is a possible implementation and to guarantee this behavior you need to use cudaMemAdvise policies
22
CONCURRENT ACCESSExclusive Access
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
atomicAdd_system(addr2, 1)
page fault
page2 unmapped in B’s memory and migrated to A
A’s page table B’s page table
23
CONCURRENT ACCESSExclusive Access
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
atomicAdd_system(addr2, 1)
local access
A’s page table B’s page table
24
CONCURRENT ACCESSAtomics over NVLINK*
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
atomicAdd_system(addr2, 1)
remote access
*both processors need to support atomic operations
atomicAdd_system(addr2, 1)
local access
A’s page table B’s page table
25
CONCURRENT ACCESSRead duplication*
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
val = *addr2;local access
*each processor must maintain its own page table
val = *addr2; local access
A’s page table B’s page table
26
CONCURRENT ACCESSRead duplication: write
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
*addr2 = val2;local access
a write will collapse all copies into one
A’s page table B’s page table
27
CONCURRENT ACCESSRead duplication: read after write
page1
page2
page3
A’s phys mem B’s phys mem page1
page2
page3
val = *addr2; local access
pages are duplicated again on faults
A’s page table B’s page table
28
ANALYTICS USE CASEDesign of a Concurrent Hybrid Hash Table
GPU 0 GPU 1
Multiple CPU Cores
Hash table implemented
via Unified Memory
Concurrent Inserts and Fetches
Non-blocking updates
using atomic compare&swap
Concurrent Fetches Concurrent Fetches
S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F
29
ANALYTICS USE CASEConcurrent Access To Hash Table
HBM
SM
L2
SMSM
SYSMEM
page fault
x86
30
SYSMEM
ANALYTICS USE CASEConcurrent Access To Hash Table
HBM
SM
L2
SMSM
page fault
page migration
access replay
x86
31
SYSMEM
ANALYTICS USE CASEConcurrent Access To Hash Table
HBM
SM
L2
SMSM
P9
cache
P9 can directly update hash entry in GPU memory
no page faults or migrations!
HBM
SM
L2
SMSM
page fault
page migration
access replay
x86
S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F
32
UNIFIED MEMORY + DGX-2
UNIFIED MEMORY PROVIDES
Single memory viewshared by all GPUs
Automatic migration of data between GPUs
User control of data locality
GPU0
GPU1
GPU2
GPU3
GPU4
GPU5
GPU6
GPU7
GPU8
GPU9
GPU10
GPU11
GPU12
GPU13
GPU14
GPU15
512 GB Unified Memory
33
ENABLING MULTI-GPU
__global__ void kernel(int *data) {
int idx = threadIdx.x + blockDim.x * blockIdx.x;
doSomeStuff(idx, data, ...);
}
cudaMallocManaged(&data, N * sizeof(int));
// initialize data on the CPU
kernel<<<grid, block>>>(data);
__global__ void kernel(int *data, int gpuId) {
int idx = threadIdx.x + blockDim.x * (blockIdx.x
+ gpuId * gridDim.x);
doSomeStuff(idx, data, ...);
}
cudaMallocManaged(&data, N * sizeof(int));
// initialize data on the CPU
for (int i = 0; i < numGPUs; i++) {
cudaSetDevice(i);
kernel<<<grid/numGPUs, block>>>(data, i);
}
Single-GPU Multi-GPU
update launch config
update blockIdx.x
34
MULTI-GPU WITH UNIFIED MEMORY
GPU 3
GPU 0
GPU 2
GPU 1
SYSMEM
GPU kernels initiate migrations
35
MULTI-GPU WITH UNIFIED MEMORY
GPU 3
GPU 0
GPU 2
GPU 1
Data automaticallypartitioned between GPUs
on first-touchSYSMEM
36
MULTI-GPU WITH UNIFIED MEMORY
GPU 3
GPU 0
GPU 2
GPU 1
With policies:Remote data can be accessed
directly without migrationsSYSMEM
37
MULTI-GPU WITH UNIFIED MEMORY
GPU 3
GPU 0
GPU 2
GPU 1
With policies:Read-only data can be
duplicated and accessed locallySYSMEM
38
GPU ARCHITECTURE AND SOFTWARE EVOLUTION
39
UNIFIED MEMORYEvolution of GPU architectures
2012 2014 2016 2017
KeplerRelease of the new “single-
pointer” programming
model
MaxwellNo new features
related to Unified Memory Pascal
On-demand migration,
concurrent access and atomics,
oversubscription
VoltaAccess counters,
P9 only: hardware coherency, ATS
support
NVLINK1
NVLINK2
*Not all features are available on all platforms
40
UNIFIED MEMORY: BEFORE PASCAL
No GPU page fault support: move all dirty pages on kernel launch
No concurrent access, no GPU memory oversubscription, no system-wide atomics
Available since CUDA 6
page1
page2
page3
A’s page table A’s phys mem B’s phys mem
page1
page2
page3
B’s page table
kernellaunch
41
UNIFIED MEMORY: PASCAL AND VOLTA
GPU page fault support, concurrent access, extended VA space (48-bit)
On-demand migration to accessing processor on first touch
Available since CUDA 8
page1
page2
A’s phys mem B’s phys mem
page1
page2
A’s page table B’s page table
42
UNIFIED MEMORY ON VOLTA+P9
If memory is mapped to the GPU, migration can be triggered by access counters
New Feature: Access Counters
page1
page2
A’s phys mem B’s phys mem
page1
page2
few accesses
many accesses
A’s page table B’s page table
43
UNIFIED MEMORY ON VOLTA+P9
With access counters only hot pages will be moved to the GPU
Migrations are delayed compared to the fault-based method
New Feature: Access Counters
page1
page2
A’s phys mem B’s phys mem
page1
page2
many accesses
*When implemented this feature can be enabled with cudaMemAdvise policies
A’s page table B’s page table
44
UNIFIED MEMORY ON VOLTA+P9
CPU can directly access and cache GPU memory
Native atomics support for all accessible memory
New Feature: Hardware Coherency with NVLINK2
page1
page2
GPU mem system mem
page1
page2
V100 P9
V100’s page table P9’s page table
45
UNIFIED MEMORY ON VOLTA+P9
ATS: address translation service; CPU and GPU can share a single page table
New Feature: ATS support
page1 page2
GPU mem CPU mem
V100 P9
ATS page table
46
UNIFIED MEMORY WITH SYSTEM ALLOCATOR
System allocator support allows GPU to access all system memory
malloc, stack, global, file system
P9: Address Translation Service (ATS)
Support enabled in CUDA 9.2
x86: Heterogeneous Memory Management (HMM)
Initial version of the patchset is integrated into 4.14 kernel
NVIDIA will be supporting upcoming versions of HMM
https://lkml.org/lkml/2017/6/23/443
47
WHAT YOU CAN DO WITH UNIFIED MEMORYSee it in action at the end of the talk!
int *data;
cudaMallocManaged(&data, sizeof(int) * n);
kernel<<<grid, block>>>(data);
int *data = (int*)malloc(sizeof(int) * n);
kernel<<<grid, block>>>(data);
int data[1024];
kernel<<<grid, block>>>(data);
int *data = (int*)alloca(sizeof(int) * n);
kernel<<<grid, block>>>(data);
extern int *data;
kernel<<<grid, block>>>(data);
Works everywhere todayWorks on Power9 + ATS in CUDA 9.2
Will work in the future on x86 + HMM
48
UNIFIED MEMORY LANGUAGES
CUDA C/C++: cudaMallocManaged
CUDA Fortran: managed attribute (per allocation)
Python: pycuda.driver.managed_empty (allocate numpy.ndarray)
OpenACC: -ta=managed compiler option (all dynamic allocations)
49
UNIFIED MEMORY + OPENACC
Literally adding a single line will get your code running on the GPU
Easy to optimize later: add loop and data directives
Effortless way to run you code on GPUs
#pragma acc kernels
{
for (i = 0; i < n; ++i) {
c[i] = a[i] + b[i];
...
}
}
...
Initiate parallel execution
50
GYROKINETIC TOROIDAL CODEParticle-In-Cell production code
http://phoenix.ps.uci.edu/gtc_group
0
2
4
6
8
10
12
14
16
18
CPU only 2xV100Data
Directives
2xV100UnifiedMemory
4xV100Data
Directives
4xV100UnifiedMemory
Speed-u
p
CPU: Haswell E5-2698 v3 @ 2.30GHz, dual socket 16-core; MPI: MVAPICH-GDR
51
PERFORMANCE DEEP DIVE
52
STREAMING BENCHMARKHow fast is on-demand migration?
0
5
10
15
20
25
30
35
x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK)G
B/s
Streaming performance
on-demand prefetch memcpy
__global__ void kernel(int *host, int *device) {
int i = threadIdx.x + blockDim.x * blockIdx.x;
device[i] = host[i];
}
// allocate and initialize memory
cudaMallocManaged(&host, size);
cudaMalloc(&device, size);
memset(host, 0, size);
// benchmark CPU->GPU migration
if (prefetch)
cudaMemPrefetchAsync(host, size, gpuId);
kernel<<<grid, block>>>(host, device);
53
==14487== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 100.00% 23.270ms 1 23.270ms 23.270ms 23.270ms void kernel(int*, int*)
API calls: 79.56% 23.272ms 1 23.272ms 23.272ms 23.272ms cudaDeviceSynchronize
20.42% 5.9732ms 1 5.9732ms 5.9732ms 5.9732ms cudaLaunch
0.01% 2.0490us 1 2.0490us 2.0490us 2.0490us cudaConfigureCall
0.01% 1.8360us 4 459ns 138ns 833ns cudaSetupArgument
==14487== Unified Memory profiling result:
Device "Tesla V100-PCIE-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device
81 - - - - 23.23181ms Gpu page fault groups
UNDERSTANDING PROFILER OUTPUT
54
HEURISTIC PREFETCHING
GPU architecture supports different page sizes
Contiguous pages up to a large page size are promoted to the larger size
Driver prefetches whole regions if pages are accessed densely
Do Not Confuse with API-prefetching
GPU
CPU
4KB 4KB 952KB
1024KB
55
WHAT IS PAGE FAULT GROUPS?
==14487== Unified Memory profiling result:
Device "Tesla V100-PCIE-16GB (0)"
Count Avg Size Min Size Max Size Total Size Total Time Name
3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device
81 - - - - 23.23181ms Gpu page fault groups
Unified Memory Virtual Address Name
8 0x3900010000 [Unified Memory GPU page faults]
9 0x3900040000 [Unified Memory GPU page faults]
5 0x3900108000 [Unified Memory GPU page faults]
5 0x3900200000 [Unified Memory GPU page faults]
nvprof --print-gpu-trace ...
56
PAGE FAULTS HANDLING
128B 128B 128B 128B 128B
thread_4B
warp 0 warp 1
GPU page
warp 2
faulted
stalled stalled
warp 3
faulted
resumed resumed
faultprocessing
replayed replayed
57
OPTIMIZING ON-DEMAND MIGRATION
64KB 64KB
warp_64KB
warp 0 warp 1
GPU page GPU page
Increase fault concurrency to reduce page fault stalls
128B 128B 128B 128B 128B
thread_4B
warp 0 warp 1
GPU page
fewer warps are stalled“spread-out” pattern improves prefetching
multiple faults per pagewarps are stalled on fault processing
58
OPTIMIZING ON-DEMAND MIGRATIONCount Avg Size Min Size Max Size Total Size Total Time Name
3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device
81 - - - - 23.23181ms Gpu page fault groups
Unified Memory Virtual Address Name
8 0x3900010000 [Unified Memory GPU page faults]
9 0x3900040000 [Unified Memory GPU page faults]
5 0x3900108000 [Unified Memory GPU page faults]
Count Avg Size Min Size Max Size Total Size Total Time Name
957 68.481KB 4.0000KB 576.00KB 64.00000MB 8.242080ms Host To Device
6 - - - - 9.769984ms Gpu page fault groups
Unified Memory Virtual Address Name
1 0x39000d0000 [Unified Memory GPU page faults]
1 0x39000c0000 [Unified Memory GPU page faults]
1 0x3900080000 [Unified Memory GPU page faults]
Thread/4B
Warp/64KB
more efficient prefetching
fewer stalls
59
STREAMING BENCHMARKHow fast is on-demand migration?
0
5
10
15
20
25
30
35
x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK)
GB
/s
Streaming performance
on-demand: thread/4B on-demand: warp/64K prefetch memcpy
128B 128B 128B 128B 128B
64KB 64KB
thread_4B
warp_64KB
warp 0 warp 1
warp 0 warp 1
Also check the Parallel Forall blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/
60
0
5
10
15
20
25
10 20 30 40 50
GB
/s
Working set (GB)
Fault-based migration throughput(P8/NVLINK)
thread/4B warp/64K
thread/4B + readmostly warp/64K + readmostly
0
1
2
3
4
5
6
7
8
9
10
10 20 30 40 50
GB
/s
Working set (GB)
Fault-based migration throughput(x86/PCIe)
thread/4B warp/64K
thread/4B + readmostly warp/64K + readmostly
GPU MEMORY OVERSUBSCRIPTIONLet’s see how perf changes as we increase the working set
GPUmemory limit
GPUmemory limit
61
EVICTION ALGORITHM
Driver keeps a list of physical chunks of GPU memory
Chunks from the front of the list are evicted first (LRU)
A chunk is considered “in use” when it is fully-populated or migrated
Prefetching and policies may impact eviction heuristic in the future
What Pages Are Moving Out of the GPU
eviction
allocation
or
migration
62
PREFETCHING AND EVICTIONS
0
2
4
6
8
10
12
on-demandthread/4B
on-demandwarp/64K
1-way prefetch
GB
/s
Migration throughput (x86+V100)
Without eviction With evictionPrefetch can be overlapped with evictions
without using CUDA streams!(enabled in CUDA 9.1)
cudaMemcpy solution requires scheduling DtoH and HtoD copies
into two separate streams
63
DRIVER ENHANCEMENTSTesla V100 + x86: HPGMG-AMR default (no hints)
0
50
100
150
200
250
1.4 4.7 8.6 28.9 58.6
MD
OF/
s
Working set (GB)
384.81 (CUDA 8.0 GA2)
387.34 (CUDA 9.0)
Improvements for the allocator
64
DRIVER ENHANCEMENTSTesla V100 + x86: HPGMG-AMR optimized (prefetches)
0
50
100
150
200
250
1.4 4.7 8.6 28.9 58.6
MD
OF/
s
Working set (GB)
384.81 (CUDA 8.0 GA2)
387.34 (CUDA 9.0)
390.30 (CUDA 9.1)
Prefetch + EvictionsOverlap
65
LOCALITY AND ACCESS CONTROL
Default: data migrates on first touch
ReadMostly: data duplicated on first touch
PreferredLocation: resist migrating away from the preferred location
AccessedBy: establish direct mapping and avoid faults
cudaMemAdvise
GPU1GPU0
GPU1GPU0
GPU1GPU0
GPU1GPU0
66
while not converged:LagrangeNodal CalcForceForNodesCalcAccelerationForNodesBoundaryConditionsForNodesCalcVelocityForNodesCalcPositionForNodes
LagrangeElementsCalcLagrangeElementsCalcQForElemsMaterialPropertiesForElems
CalcTimeConstraintsForElemsCalcCourantConstraintForElemsCalcHydroConstraintForElems
LULESHCORAL OpenACC app
few heavy GPU kernels
many small GPU kernelsmany small CPU functions
67
USING VISUAL PROFILER
Kernel time = 17ms Kernel time = 27ms
1st iteration 2nd iteration longer, why?
what’s this?
68
THRASHING MITIGATION
Before CUDA 9.2: when memory is pinned we lose any insight into access pattern
In the future we may use access counters on Volta to find a better location
Processors frequently read or write to the same page
GPU
CPU
Page Throttling Remote Map
Memory thrashing detected
69
LULESH: PREFERRED LOCATION = GPU
No policies
PreferredLocation
lots of back-and-forthmigrations
but no slowdownfor the heavy kernels
anti-thrashing heuristic kicks in and pins the page to CPU
70
LULESH: READ MOSTLY
no slowdownfor the heavy kernels
almost no thrashingfor the hybrid CPU/GPU part
71
0
10,000
20,000
30,000
40,000
50,000
60,000
70,000
Default Preferred ReadMostly Improved DefaultHeuristics
FOM
(z/
s)
LULESH performance, 200^3 mesh, 100 iterations, CUDA 9.1
Quadro GP100 Quadro GV100
LULESH: PERF IMPROVEMENTS
GP100 has 500x moresysmem writes than GV100
Will be enabled in the future drivers
72
WHEN TO USE UNIFIED MEMORY
cudaMalloc cudaMallocManaged
Pinned allocation cudaMalloc
cudaMallocManagedPreferredLocation(GPU)SetAccessedBy(peer GPUs)cudaMemPrefetchAsync(GPU)
cudaMemcpy: ptrA -> ptrBStaging for non-pinned allocations
or between non-P2P GPUs
Staging or a copy kernel
required in all cases
Memory migration Not possible cudaMemPrefetchAsync
Debugging Difficult Easy
Oversubscription No Yes
IPC support Yes No
73
WHEN TO USE UNIFIED MEMORY
Application Code
Sequential code (CPU)
Sequential code (CPU)
Parallel code (GPU)
cudaMalloc
cudaMallocManaged/malloc
cudaMallocManaged/malloc
Parallel Code
75
UNIFIED MEMORY PLATFORMS
KEPLER PASCAL VOLTA
Linux + x86No GPU fault support
No concurrent accessOn-demand migration On-demand migration
Linux + Power
On-demand migration
80GB/s CPU-GPU BW*On-demand migration
150GB/s CPU-GPU BW**
Access counters
HW coherency
ATS support
WindowsNo GPU fault support
No concurrent access
MacOSNo GPU fault support
No concurrent access
TegraCached on CPU and iGPU
No concurrent access
*IBM Minsky: 4xP100 + 2xP8, 2xNVLINK1 links between P100 and P8, bi-directional aggregate BW**IBM Newell: 4xV100 + 2xP9, 3xNVLINK2 links between V100 and P9, bi-directional aggregate BW
76
READ DUPLICATIONUsage example
char *data;cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..SetReadMostly, myGpuId);cudaMemPrefetchAsync(data, N, myGpuId, s);mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
The prefetch creates a copy instead of
moving data
Both processors can read data
simultaneously without faults
Writes will collapse all copies into one,
subsequent reads will fault and duplicate
GPU: my_kernel
CPU: init_data
CPU: use_data
77
PREFERRED LOCATIONResisting migrations
char *data;cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
The kernel will page fault and
generate direct mapping to
data on the CPU
The driver will “resist”
migrating data away from the
preferred location
GPU: my_kernel
CPU: init_data
CPU: use_data
78
PREFERRED LOCATIONPage population on first-touch
char *data;cudaMallocManaged(&data, N);
cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
The kernel will page fault,
populate pages on the CPU
and generate direct mapping to
data on the CPU
Pages are populated on the
preferred location if the
faulting processor can access it
GPU: my_kernelCPU: use_data
79
PREFERRED LOCATION ON P9+V100CPU can directly access GPU memory
char *data;cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..PreferredLocation, gpuId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
The kernel will page fault and
migrate data to the GPU
CPU will fault and access data
directly instead of migrating
on non P9+V100 systems the driver will migrate back to the CPU
GPU: my_kernel
CPU: init_data
CPU: use_data
80
ACCESSED BYUsage example
char *data;cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
GPU will establish direct mapping of
data in CPU memory, no page faults
will be generated
Memory can move freely to other
processors and mapping will carry
over
GPU: my_kernel
CPU: init_data
CPU: use_data
81
ACCESSED BYUsing access counters on Volta
char *data;cudaMallocManaged(&data, N);
init_data(data, N);
cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);
mykernel<<<..., s>>>(data, N);
use_data(data, N);
cudaFree(data);
GPU will establish direct mapping of
data in CPU memory, no page faults
will be generated
Access counters may eventually
trigger migration of frequently
accessed pages to the GPU
GPU: my_kernel
CPU: init_data
CPU: use_data
82
ptr = malloc(size);
doStuffOnGpu<<<...>>>(ptr, size);
MANAGED VS MALLOC ON VOLTA+P9First touch allocation policy
GPU page faults
Unified Memory driver allocates on GPU
GPU accesses GPU memory
ptr = cudaMallocManaged(size);
doStuffOnGpu<<<...>>>(ptr, size);
*You may alter this behavior by using cudaMemAdvise policies
GPU uses ATS, faults
OS allocates on CPU (by default)
GPU uses ATS to access CPU memory
83
MANAGED VS MALLOC ON P9cudaMallocManaged: same behavior as x86
ptr = cudaMallocManaged(size);
fillData(ptr, size);
doStuffOnGpu<<<...>>>(ptr, size);
cudaDeviceSynchronize();
doStuffOnCpu(ptr, size);
GPU page faultsptr migrated to GPU
CPU page faultsptr migrated to CPU
84
MANAGED VS MALLOC ON P9malloc: no on-demand migrations*
ptr = malloc(size);
fillData(ptr, size);
doStuffOnGpu<<<...>>>(ptr, size);
cudaDeviceSynchronize();
doStuffOnCpu(ptr, size);
GPU uses ATS to access CPU memory(no on-demand migrationexcept cudaMemPrefetchAsync*)
CPU accesses CPU memory
*In the future Volta access counters will be used to migrate malloc memory
85
HYPRE-INSPIRED USE CASE
Algebraic Multi-Grid library: https://github.com/LLNL/hypre
Lots of small allocations: multiple variables may end up on the same page
If used by different processors this will result in false-sharing
512B 512B 512B 512B 512B 512B 512B 512B
GPU CPU
single 4KB page
86
FALSE-SHARING
Issues with false-sharing:
- Spurious migrations, thrashing mitigation does not solve it
- Performance hints are applied on page boundaries, due to suballocation data may inherit the wrong policies
How to mitigate this:
- Use separate allocators or memory pools for CPU and GPU