EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES...

Nikolay Sakharnykh, 3/27/2018

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY

2

SINGLE POINTER

void *data;

data = malloc(N);

cpu_func1(data, N);

cpu_func2(data, N);

cpu_func3(data, N);

free(data);

void *data;

data = malloc(N);

cpu_func1(data, N);

gpu_func2<<<...>>>(data, N);

cudaDeviceSynchronize();

cpu_func3(data, N);

free(data);

CPU vs GPU

CPU code GPU code w/ Unified Memory

3

SINGLE POINTER

void *data, *d_data;

data = malloc(N);

cudaMalloc(&d_data, N);

cpu_func1(data, N);

cudaMemcpy(d_data, data, N, ...)

gpu_func2<<<...>>>(d_data, N);

cudaMemcpy(data, d_data, N, ...)

cudaFree(d_data);

cpu_func3(data, N);

free(data);

void *data;

data = malloc(N);

cpu_func1(data, N);



cpu_func3(data, N);

free(data);

Explicit vs Unified Memory

Explicit Memory Management GPU code w/ Unified Memory

4

SINGLE POINTER

void *data, *d_data;

data = malloc(N);

cudaMalloc(&d_data, N);

cpu_func1(data, N);

cudaMemcpy(d_data, data, N, ...)

gpu_func2<<<...>>>(d_data, N);

cudaMemcpy(data, d_data, N, ...)

cudaFree(d_data);

cpu_func3(data, N);

free(data);

void *data;

data = malloc(N);

cpu_func1(data, N);

cudaMemPrefetchAsync(data, N, GPU)


cudaMemPrefetchAsync(data, N, CPU)


cpu_func3(data, N);

free(data);

Full Control with Prefetching

Explicit Memory Management Unified Memory + Prefetching

5

SINGLE POINTER

char **data;

// allocate and initialize data on the CPU

char **d_data;

char **h_data = (char**)malloc(N*sizeof(char*));

for (int i = 0; i < N; i++) {

cudaMalloc(&h_data[i], N);

cudaMemcpy(h_data[i], data[i], N, ...);

}

cudaMalloc(&d_data, N*sizeof(char*));

cudaMemcpy(d_data, h_data, N*sizeof(char*), ...);

gpu_func<<<...>>>(d_data, N);

char **data;

// allocate and initialize data on the CPU

gpu_func<<<...>>>(data, N);

Deep Copy

Explicit Memory Management GPU code w/ Unified Memory

6

UNIFIED MEMORY BASICS

GPU A GPU B

page1 page2

Single virtual memory shared between processors

7


page1

page2

GPU A GPU B

A’s page table

A’s phys mem B’s phys mem page1

page2

B’s page table

page1 page2

Single virtual memory shared between processors

8


page1

page2

page3


page2

page3

*addr1 = 1local access

*addr3 = 1page fault

A’s page table B’s page table

9


page1

page2

page3


page2

page3

*addr3 = 1access replay

page3 populated and mapped into B’s memory


10


page1

page2

page3


page2

page3




11


page1

page2

page3


page2

page3



page2 and page3 unmapped from B’s memory


12


page1

page2

page3


page2

page3



pages data migrated to A’s physical memory


13


page1

page2

page3


page2

page3




14

MEMORY OVERSUBSCRIPTION

page1

page1

page3

page4

page5

A’s phys mem B’s phys mem

page1

page1

page3

page4

page5*addr5 = 1page fault


15


page1

page1

page3

page4

page5


page1

page1

page3

page4


page4 unmapped from A’s memory and migrated


16


page1

page1

page3

page4

page5


page1

page1

page3

page4


page4 mapped in B’s memory, page5 unmapped and migrated to A


17


page1

page1

page3

page4

page5


page1

page1

page3

page4

page5*addr5 = 1

access replay


18

SIMPLIFYING DL FRAMEWORK DESIGN

Eliminated 3,000 lines of repetitive and error-prone code in Caffe

Developers can add new inherited Layer classes in a much simpler manner

The final call to a CPU function or a GPU kernel (caffe_gpu_gemm) still need to be explicit

class ConvolutionLayer{public:

void cpu_data()void cpu_diff()void gpu_data()void gpu_diff()

void mutable_cpu_data()void mutable_cpu_diff()void mutable_gpu_data()void mutable_gpu_diff()

void Forward_cpu()void Forward_gpu()void forward_cpu_gemm()void forward_gpu_gemm()void forward_cpu_bias()void forward_gpu_bias()

void Backward_cpu()void Backward_gpu()void backward_cpu_gemm()void backward_gpu_gemm()void backward_cpu_bias()void backward_gpu_bias()

}

class ConvolutionLayer{public:

void data()void diff()

void mutable_data()void mutable_diff()

void Forward()void forward_gemm()void forward_bias()

void Backward()void backward_gemm()void backward_bias()

}

Existing Design

Unified Memory

Design

A. A. Awan, C-H Chu, H. Subramoni, X. Lu, D.K. Panda, “OC-DNN: Designing

Out-of-Core Deep Neural Network Training by Exploiting Unified Memory on

Pascal and Volta GPUs”, <double-blind submission under review>

19

CAN THIS DESIGN OFFER GOOD PERF?DL training with Unified Memory

OC-Caffe will be released by the HiDL Team@OSU: hidl.cse.ohio-state.edu, mvapich.cse.ohio-state.edu

0

20

40

60

80

100

120

140

160

Batch 40 Batch 45Im

ages

/sec

ResNet-50 training on 1xV100 (16GB)

Caffe (BVLC) OC-Caffe

0

20

40

60

80

100

120

140

160

Batch 110 Batch 120

Imag

es/s

ec

VGG19 training on 1xV100 (16GB)

Caffe (BVLC) OC-Caffe

out-of-corein-memory out-of-corein-memory

http://hidl.cse.ohio-state.edu/

http://mvapich.cse.ohio-state.edu/

20

CONCURRENT ACCESS

page1

page2

page3


page2

page3


21

CONCURRENT ACCESSExclusive Access*

page1

page2

page3


page2

page3

atomicAdd_system(addr2, 1)

page fault


local access


*this is a possible implementation and to guarantee this behavior you need to use cudaMemAdvise policies

22

CONCURRENT ACCESSExclusive Access

page1

page2

page3


page2

page3


page fault

page2 unmapped in B’s memory and migrated to A


23

CONCURRENT ACCESSExclusive Access

page1

page2

page3


page2

page3


local access


24

CONCURRENT ACCESSAtomics over NVLINK*

page1

page2

page3


page2

page3


remote access

*both processors need to support atomic operations


local access


25

CONCURRENT ACCESSRead duplication*

page1

page2

page3


page2

page3

val = *addr2;local access

*each processor must maintain its own page table

val = *addr2; local access


26

CONCURRENT ACCESSRead duplication: write

page1

page2

page3


page2

page3

*addr2 = val2;local access

a write will collapse all copies into one


27

CONCURRENT ACCESSRead duplication: read after write

page1

page2

page3


page2

page3

val = *addr2; local access

pages are duplicated again on faults


28

ANALYTICS USE CASEDesign of a Concurrent Hybrid Hash Table

GPU 0 GPU 1

Multiple CPU Cores

Hash table implemented

via Unified Memory

Concurrent Inserts and Fetches

Non-blocking updates

using atomic compare&swap

Concurrent Fetches Concurrent Fetches

S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F

29

ANALYTICS USE CASEConcurrent Access To Hash Table

HBM

SM

L2

SMSM

SYSMEM

page fault

x86

30

SYSMEM


HBM

SM

L2

SMSM

page fault

page migration

access replay

x86

31

SYSMEM


HBM

SM

L2

SMSM

P9

cache

P9 can directly update hash entry in GPU memory

no page faults or migrations!

HBM

SM

L2

SMSM

page fault

page migration

access replay

x86

S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F

32

UNIFIED MEMORY + DGX-2

UNIFIED MEMORY PROVIDES

Single memory viewshared by all GPUs

Automatic migration of data between GPUs

User control of data locality

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

512 GB Unified Memory

33

ENABLING MULTI-GPU

__global__ void kernel(int *data) {

int idx = threadIdx.x + blockDim.x * blockIdx.x;

doSomeStuff(idx, data, ...);

}

cudaMallocManaged(&data, N * sizeof(int));

// initialize data on the CPU

kernel<<<grid, block>>>(data);

__global__ void kernel(int *data, int gpuId) {

int idx = threadIdx.x + blockDim.x * (blockIdx.x

+ gpuId * gridDim.x);

doSomeStuff(idx, data, ...);

}

cudaMallocManaged(&data, N * sizeof(int));

// initialize data on the CPU

for (int i = 0; i < numGPUs; i++) {

cudaSetDevice(i);

kernel<<<grid/numGPUs, block>>>(data, i);

}

Single-GPU Multi-GPU

update launch config

update blockIdx.x

34

MULTI-GPU WITH UNIFIED MEMORY

GPU 3

GPU 0

GPU 2

GPU 1

SYSMEM

GPU kernels initiate migrations

35


GPU 3

GPU 0

GPU 2

GPU 1

Data automaticallypartitioned between GPUs

on first-touchSYSMEM

36


GPU 3

GPU 0

GPU 2

GPU 1

With policies:Remote data can be accessed

directly without migrationsSYSMEM

37


GPU 3

GPU 0

GPU 2

GPU 1

With policies:Read-only data can be

duplicated and accessed locallySYSMEM

38

GPU ARCHITECTURE AND SOFTWARE EVOLUTION

39

UNIFIED MEMORYEvolution of GPU architectures

2012 2014 2016 2017

KeplerRelease of the new “single-

pointer” programming

model

MaxwellNo new features

related to Unified Memory Pascal

On-demand migration,

concurrent access and atomics,

oversubscription

VoltaAccess counters,

P9 only: hardware coherency, ATS

support

NVLINK1

NVLINK2

*Not all features are available on all platforms

40

UNIFIED MEMORY: BEFORE PASCAL

No GPU page fault support: move all dirty pages on kernel launch

No concurrent access, no GPU memory oversubscription, no system-wide atomics

Available since CUDA 6

page1

page2

page3

A’s page table A’s phys mem B’s phys mem

page1

page2

page3

B’s page table

kernellaunch

41

UNIFIED MEMORY: PASCAL AND VOLTA

GPU page fault support, concurrent access, extended VA space (48-bit)

On-demand migration to accessing processor on first touch

Available since CUDA 8

page1

page2


page1

page2


42

UNIFIED MEMORY ON VOLTA+P9

If memory is mapped to the GPU, migration can be triggered by access counters

New Feature: Access Counters

page1

page2


page1

page2

few accesses

many accesses


43


With access counters only hot pages will be moved to the GPU

Migrations are delayed compared to the fault-based method

New Feature: Access Counters

page1

page2


page1

page2

many accesses

*When implemented this feature can be enabled with cudaMemAdvise policies


44


CPU can directly access and cache GPU memory

Native atomics support for all accessible memory

New Feature: Hardware Coherency with NVLINK2

page1

page2

GPU mem system mem

page1

page2

V100 P9

V100’s page table P9’s page table

45


ATS: address translation service; CPU and GPU can share a single page table

New Feature: ATS support

page1 page2

GPU mem CPU mem

V100 P9

ATS page table

46

UNIFIED MEMORY WITH SYSTEM ALLOCATOR

System allocator support allows GPU to access all system memory

malloc, stack, global, file system

P9: Address Translation Service (ATS)

Support enabled in CUDA 9.2

x86: Heterogeneous Memory Management (HMM)

Initial version of the patchset is integrated into 4.14 kernel

NVIDIA will be supporting upcoming versions of HMM

https://lkml.org/lkml/2017/6/23/443

https://lkml.org/lkml/2017/6/23/443

47

WHAT YOU CAN DO WITH UNIFIED MEMORYSee it in action at the end of the talk!

int *data;

cudaMallocManaged(&data, sizeof(int) * n);


int *data = (int*)malloc(sizeof(int) * n);


int data[1024];


int *data = (int*)alloca(sizeof(int) * n);


extern int *data;


Works everywhere todayWorks on Power9 + ATS in CUDA 9.2

Will work in the future on x86 + HMM

48

UNIFIED MEMORY LANGUAGES

CUDA C/C++: cudaMallocManaged

CUDA Fortran: managed attribute (per allocation)

Python: pycuda.driver.managed_empty (allocate numpy.ndarray)

OpenACC: -ta=managed compiler option (all dynamic allocations)

49

UNIFIED MEMORY + OPENACC

Literally adding a single line will get your code running on the GPU

Easy to optimize later: add loop and data directives

Effortless way to run you code on GPUs

#pragma acc kernels

{

for (i = 0; i < n; ++i) {

c[i] = a[i] + b[i];

...

}

}

...

Initiate parallel execution

50

GYROKINETIC TOROIDAL CODEParticle-In-Cell production code

http://phoenix.ps.uci.edu/gtc_group

0

2

4

6

8

10

12

14

16

18

CPU only 2xV100Data

Directives

2xV100UnifiedMemory

4xV100Data

Directives

4xV100UnifiedMemory

Speed-u

p

CPU: Haswell E5-2698 v3 @ 2.30GHz, dual socket 16-core; MPI: MVAPICH-GDR

http://phoenix.ps.uci.edu/gtc_group

51

PERFORMANCE DEEP DIVE

52

STREAMING BENCHMARKHow fast is on-demand migration?

0

5

10

15

20

25

30

35

x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK)G

B/s

Streaming performance

on-demand prefetch memcpy

__global__ void kernel(int *host, int *device) {

int i = threadIdx.x + blockDim.x * blockIdx.x;

device[i] = host[i];

}

// allocate and initialize memory

cudaMallocManaged(&host, size);

cudaMalloc(&device, size);

memset(host, 0, size);

// benchmark CPU->GPU migration

if (prefetch)

cudaMemPrefetchAsync(host, size, gpuId);

kernel<<<grid, block>>>(host, device);

53

==14487== Profiling result:

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 23.270ms 1 23.270ms 23.270ms 23.270ms void kernel(int*, int*)

API calls: 79.56% 23.272ms 1 23.272ms 23.272ms 23.272ms cudaDeviceSynchronize

20.42% 5.9732ms 1 5.9732ms 5.9732ms 5.9732ms cudaLaunch

0.01% 2.0490us 1 2.0490us 2.0490us 2.0490us cudaConfigureCall

0.01% 1.8360us 4 459ns 138ns 833ns cudaSetupArgument

==14487== Unified Memory profiling result:

Device "Tesla V100-PCIE-16GB (0)"

Count Avg Size Min Size Max Size Total Size Total Time Name

3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device

81 - - - - 23.23181ms Gpu page fault groups

UNDERSTANDING PROFILER OUTPUT

54

HEURISTIC PREFETCHING

GPU architecture supports different page sizes

Contiguous pages up to a large page size are promoted to the larger size

Driver prefetches whole regions if pages are accessed densely

Do Not Confuse with API-prefetching

GPU

CPU

4KB 4KB 952KB

1024KB

55

WHAT IS PAGE FAULT GROUPS?

==14487== Unified Memory profiling result:

Device "Tesla V100-PCIE-16GB (0)"




Unified Memory Virtual Address Name

8 0x3900010000 [Unified Memory GPU page faults]




nvprof --print-gpu-trace ...

56

PAGE FAULTS HANDLING

128B 128B 128B 128B 128B

thread_4B

warp 0 warp 1

GPU page

warp 2

faulted

stalled stalled

warp 3

faulted

resumed resumed

faultprocessing

replayed replayed

57

OPTIMIZING ON-DEMAND MIGRATION

64KB 64KB

warp_64KB

warp 0 warp 1

GPU page GPU page

Increase fault concurrency to reduce page fault stalls

128B 128B 128B 128B 128B

thread_4B

warp 0 warp 1

GPU page

fewer warps are stalled“spread-out” pattern improves prefetching

multiple faults per pagewarps are stalled on fault processing

58

OPTIMIZING ON-DEMAND MIGRATIONCount Avg Size Min Size Max Size Total Size Total Time Name











1 0x39000d0000 [Unified Memory GPU page faults]

1 0x39000c0000 [Unified Memory GPU page faults]


Thread/4B

Warp/64KB

more efficient prefetching

fewer stalls

59

STREAMING BENCHMARKHow fast is on-demand migration?

0

5

10

15

20

25

30

35

x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK)

GB

/s

Streaming performance

on-demand: thread/4B on-demand: warp/64K prefetch memcpy

128B 128B 128B 128B 128B

64KB 64KB

thread_4B

warp_64KB

warp 0 warp 1

warp 0 warp 1

Also check the Parallel Forall blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

60

0

5

10

15

20

25

10 20 30 40 50

GB

/s

Working set (GB)

Fault-based migration throughput(P8/NVLINK)

thread/4B warp/64K

thread/4B + readmostly warp/64K + readmostly

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50

GB

/s

Working set (GB)

Fault-based migration throughput(x86/PCIe)

thread/4B warp/64K

thread/4B + readmostly warp/64K + readmostly

GPU MEMORY OVERSUBSCRIPTIONLet’s see how perf changes as we increase the working set

GPUmemory limit

GPUmemory limit

61

EVICTION ALGORITHM

Driver keeps a list of physical chunks of GPU memory

Chunks from the front of the list are evicted first (LRU)

A chunk is considered “in use” when it is fully-populated or migrated

Prefetching and policies may impact eviction heuristic in the future

What Pages Are Moving Out of the GPU

eviction

allocation

or

migration

62

PREFETCHING AND EVICTIONS

0

2

4

6

8

10

12

on-demandthread/4B

on-demandwarp/64K

1-way prefetch

GB

/s

Migration throughput (x86+V100)

Without eviction With evictionPrefetch can be overlapped with evictions

without using CUDA streams!(enabled in CUDA 9.1)

cudaMemcpy solution requires scheduling DtoH and HtoD copies

into two separate streams

63

DRIVER ENHANCEMENTSTesla V100 + x86: HPGMG-AMR default (no hints)

0

50

100

150

200

250

1.4 4.7 8.6 28.9 58.6

MD

OF/

s

Working set (GB)

384.81 (CUDA 8.0 GA2)

387.34 (CUDA 9.0)

Improvements for the allocator

64

DRIVER ENHANCEMENTSTesla V100 + x86: HPGMG-AMR optimized (prefetches)

0

50

100

150

200

250

1.4 4.7 8.6 28.9 58.6

MD

OF/

s

Working set (GB)

384.81 (CUDA 8.0 GA2)

387.34 (CUDA 9.0)

390.30 (CUDA 9.1)

Prefetch + EvictionsOverlap

65

LOCALITY AND ACCESS CONTROL

Default: data migrates on first touch

ReadMostly: data duplicated on first touch

PreferredLocation: resist migrating away from the preferred location

AccessedBy: establish direct mapping and avoid faults

cudaMemAdvise

GPU1GPU0

GPU1GPU0

GPU1GPU0

GPU1GPU0

66

while not converged:LagrangeNodal CalcForceForNodesCalcAccelerationForNodesBoundaryConditionsForNodesCalcVelocityForNodesCalcPositionForNodes

LagrangeElementsCalcLagrangeElementsCalcQForElemsMaterialPropertiesForElems

CalcTimeConstraintsForElemsCalcCourantConstraintForElemsCalcHydroConstraintForElems

LULESHCORAL OpenACC app

few heavy GPU kernels

many small GPU kernelsmany small CPU functions

67

USING VISUAL PROFILER

Kernel time = 17ms Kernel time = 27ms

1st iteration 2nd iteration longer, why?

what’s this?

68

THRASHING MITIGATION

Before CUDA 9.2: when memory is pinned we lose any insight into access pattern

In the future we may use access counters on Volta to find a better location

Processors frequently read or write to the same page

GPU

CPU

Page Throttling Remote Map

Memory thrashing detected

69

LULESH: PREFERRED LOCATION = GPU

No policies

PreferredLocation

lots of back-and-forthmigrations

but no slowdownfor the heavy kernels

anti-thrashing heuristic kicks in and pins the page to CPU

70

LULESH: READ MOSTLY

no slowdownfor the heavy kernels

almost no thrashingfor the hybrid CPU/GPU part

71

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

Default Preferred ReadMostly Improved DefaultHeuristics

FOM

(z/

s)

LULESH performance, 200^3 mesh, 100 iterations, CUDA 9.1

Quadro GP100 Quadro GV100

LULESH: PERF IMPROVEMENTS

GP100 has 500x moresysmem writes than GV100

Will be enabled in the future drivers

72

WHEN TO USE UNIFIED MEMORY

cudaMalloc cudaMallocManaged

Pinned allocation cudaMalloc

cudaMallocManagedPreferredLocation(GPU)SetAccessedBy(peer GPUs)cudaMemPrefetchAsync(GPU)

cudaMemcpy: ptrA -> ptrBStaging for non-pinned allocations

or between non-P2P GPUs

Staging or a copy kernel

required in all cases

Memory migration Not possible cudaMemPrefetchAsync

Debugging Difficult Easy

Oversubscription No Yes

IPC support Yes No

73

WHEN TO USE UNIFIED MEMORY

Application Code

Sequential code (CPU)

Sequential code (CPU)

Parallel code (GPU)

cudaMalloc

cudaMallocManaged/malloc

cudaMallocManaged/malloc

Parallel Code

75

UNIFIED MEMORY PLATFORMS

KEPLER PASCAL VOLTA

Linux + x86No GPU fault support

No concurrent accessOn-demand migration On-demand migration

Linux + Power

On-demand migration

80GB/s CPU-GPU BW*On-demand migration

150GB/s CPU-GPU BW**

Access counters

HW coherency

ATS support

WindowsNo GPU fault support

No concurrent access

MacOSNo GPU fault support


TegraCached on CPU and iGPU


*IBM Minsky: 4xP100 + 2xP8, 2xNVLINK1 links between P100 and P8, bi-directional aggregate BW**IBM Newell: 4xV100 + 2xP9, 3xNVLINK2 links between V100 and P9, bi-directional aggregate BW

76

READ DUPLICATIONUsage example

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..SetReadMostly, myGpuId);cudaMemPrefetchAsync(data, N, myGpuId, s);mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

The prefetch creates a copy instead of

moving data

Both processors can read data

simultaneously without faults

Writes will collapse all copies into one,

subsequent reads will fault and duplicate

GPU: my_kernel

CPU: init_data

CPU: use_data

77

PREFERRED LOCATIONResisting migrations


init_data(data, N);

cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

The kernel will page fault and

generate direct mapping to

data on the CPU

The driver will “resist”

migrating data away from the

preferred location

GPU: my_kernel

CPU: init_data

CPU: use_data

78

PREFERRED LOCATIONPage population on first-touch


cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);


use_data(data, N);

cudaFree(data);

The kernel will page fault,

populate pages on the CPU

and generate direct mapping to

data on the CPU

Pages are populated on the

preferred location if the

faulting processor can access it

GPU: my_kernelCPU: use_data

79

PREFERRED LOCATION ON P9+V100CPU can directly access GPU memory


init_data(data, N);

cudaMemAdvise(data, N, ..PreferredLocation, gpuId);


use_data(data, N);

cudaFree(data);

The kernel will page fault and

migrate data to the GPU

CPU will fault and access data

directly instead of migrating

on non P9+V100 systems the driver will migrate back to the CPU

GPU: my_kernel

CPU: init_data

CPU: use_data

80

ACCESSED BYUsage example


init_data(data, N);

cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);


use_data(data, N);

cudaFree(data);

GPU will establish direct mapping of

data in CPU memory, no page faults

will be generated

Memory can move freely to other

processors and mapping will carry

over

GPU: my_kernel

CPU: init_data

CPU: use_data

81

ACCESSED BYUsing access counters on Volta


init_data(data, N);

cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);


use_data(data, N);

cudaFree(data);

GPU will establish direct mapping of

data in CPU memory, no page faults

will be generated

Access counters may eventually

trigger migration of frequently

accessed pages to the GPU

GPU: my_kernel

CPU: init_data

CPU: use_data

82

ptr = malloc(size);

doStuffOnGpu<<<...>>>(ptr, size);

MANAGED VS MALLOC ON VOLTA+P9First touch allocation policy

GPU page faults

Unified Memory driver allocates on GPU

GPU accesses GPU memory

ptr = cudaMallocManaged(size);


*You may alter this behavior by using cudaMemAdvise policies

GPU uses ATS, faults

OS allocates on CPU (by default)

GPU uses ATS to access CPU memory

83

MANAGED VS MALLOC ON P9cudaMallocManaged: same behavior as x86

ptr = cudaMallocManaged(size);

fillData(ptr, size);



doStuffOnCpu(ptr, size);

GPU page faultsptr migrated to GPU

CPU page faultsptr migrated to CPU

84

MANAGED VS MALLOC ON P9malloc: no on-demand migrations*

ptr = malloc(size);

fillData(ptr, size);



doStuffOnCpu(ptr, size);

GPU uses ATS to access CPU memory(no on-demand migrationexcept cudaMemPrefetchAsync*)

CPU accesses CPU memory

*In the future Volta access counters will be used to migrate malloc memory

85

HYPRE-INSPIRED USE CASE

Algebraic Multi-Grid library: https://github.com/LLNL/hypre

Lots of small allocations: multiple variables may end up on the same page

If used by different processors this will result in false-sharing

512B 512B 512B 512B 512B 512B 512B 512B

GPU CPU

single 4KB page

https://github.com/LLNL/hypre

86

FALSE-SHARING

Issues with false-sharing:

- Spurious migrations, thrashing mitigation does not solve it

- Performance hints are applied on page boundaries, due to suballocation data may inherit the wrong policies

How to mitigate this:

- Use separate allocators or memory pools for CPU and GPU

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES...

Documents

Transcript of EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES...