EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES...

86
Nikolay Sakharnykh, 3/27/2018 EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY

Transcript of EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES...

Page 1: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

Nikolay Sakharnykh, 3/27/2018

EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY

Page 2: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

2

SINGLE POINTER

void *data;

data = malloc(N);

cpu_func1(data, N);

cpu_func2(data, N);

cpu_func3(data, N);

free(data);

void *data;

data = malloc(N);

cpu_func1(data, N);

gpu_func2<<<...>>>(data, N);

cudaDeviceSynchronize();

cpu_func3(data, N);

free(data);

CPU vs GPU

CPU code GPU code w/ Unified Memory

Page 3: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

3

SINGLE POINTER

void *data, *d_data;

data = malloc(N);

cudaMalloc(&d_data, N);

cpu_func1(data, N);

cudaMemcpy(d_data, data, N, ...)

gpu_func2<<<...>>>(d_data, N);

cudaMemcpy(data, d_data, N, ...)

cudaFree(d_data);

cpu_func3(data, N);

free(data);

void *data;

data = malloc(N);

cpu_func1(data, N);

gpu_func2<<<...>>>(data, N);

cudaDeviceSynchronize();

cpu_func3(data, N);

free(data);

Explicit vs Unified Memory

Explicit Memory Management GPU code w/ Unified Memory

Page 4: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

4

SINGLE POINTER

void *data, *d_data;

data = malloc(N);

cudaMalloc(&d_data, N);

cpu_func1(data, N);

cudaMemcpy(d_data, data, N, ...)

gpu_func2<<<...>>>(d_data, N);

cudaMemcpy(data, d_data, N, ...)

cudaFree(d_data);

cpu_func3(data, N);

free(data);

void *data;

data = malloc(N);

cpu_func1(data, N);

cudaMemPrefetchAsync(data, N, GPU)

gpu_func2<<<...>>>(data, N);

cudaMemPrefetchAsync(data, N, CPU)

cudaDeviceSynchronize();

cpu_func3(data, N);

free(data);

Full Control with Prefetching

Explicit Memory Management Unified Memory + Prefetching

Page 5: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

5

SINGLE POINTER

char **data;

// allocate and initialize data on the CPU

char **d_data;

char **h_data = (char**)malloc(N*sizeof(char*));

for (int i = 0; i < N; i++) {

cudaMalloc(&h_data[i], N);

cudaMemcpy(h_data[i], data[i], N, ...);

}

cudaMalloc(&d_data, N*sizeof(char*));

cudaMemcpy(d_data, h_data, N*sizeof(char*), ...);

gpu_func<<<...>>>(d_data, N);

char **data;

// allocate and initialize data on the CPU

gpu_func<<<...>>>(data, N);

Deep Copy

Explicit Memory Management GPU code w/ Unified Memory

Page 6: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

6

UNIFIED MEMORY BASICS

GPU A GPU B

page1 page2

Single virtual memory shared between processors

Page 7: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

7

UNIFIED MEMORY BASICS

page1

page2

GPU A GPU B

A’s page table

A’s phys mem B’s phys mem page1

page2

B’s page table

page1 page2

Single virtual memory shared between processors

Page 8: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

8

UNIFIED MEMORY BASICS

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

*addr1 = 1local access

*addr3 = 1page fault

A’s page table B’s page table

Page 9: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

9

UNIFIED MEMORY BASICS

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

*addr3 = 1access replay

page3 populated and mapped into B’s memory

A’s page table B’s page table

Page 10: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

10

UNIFIED MEMORY BASICS

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

*addr2 = 1page fault

*addr3 = 1page fault

A’s page table B’s page table

Page 11: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

11

UNIFIED MEMORY BASICS

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

*addr2 = 1page fault

*addr3 = 1page fault

page2 and page3 unmapped from B’s memory

A’s page table B’s page table

Page 12: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

12

UNIFIED MEMORY BASICS

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

*addr2 = 1page fault

*addr3 = 1page fault

pages data migrated to A’s physical memory

A’s page table B’s page table

Page 13: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

13

UNIFIED MEMORY BASICS

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

*addr2 = 1access replay

*addr3 = 1access replay

A’s page table B’s page table

Page 14: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

14

MEMORY OVERSUBSCRIPTION

page1

page1

page3

page4

page5

A’s phys mem B’s phys mem

page1

page1

page3

page4

page5*addr5 = 1page fault

A’s page table B’s page table

Page 15: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

15

MEMORY OVERSUBSCRIPTION

page1

page1

page3

page4

page5

A’s phys mem B’s phys mem

page1

page1

page3

page4

page5*addr5 = 1page fault

page4 unmapped from A’s memory and migrated

A’s page table B’s page table

Page 16: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

16

MEMORY OVERSUBSCRIPTION

page1

page1

page3

page4

page5

A’s phys mem B’s phys mem

page1

page1

page3

page4

page5*addr5 = 1page fault

page4 mapped in B’s memory, page5 unmapped and migrated to A

A’s page table B’s page table

Page 17: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

17

MEMORY OVERSUBSCRIPTION

page1

page1

page3

page4

page5

A’s phys mem B’s phys mem

page1

page1

page3

page4

page5*addr5 = 1

access replay

A’s page table B’s page table

Page 18: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

18

SIMPLIFYING DL FRAMEWORK DESIGN

Eliminated 3,000 lines of repetitive and error-prone code in Caffe

Developers can add new inherited Layer classes in a much simpler manner

The final call to a CPU function or a GPU kernel (caffe_gpu_gemm) still need to be explicit

class ConvolutionLayer{public:

void cpu_data()void cpu_diff()void gpu_data()void gpu_diff()

void mutable_cpu_data()void mutable_cpu_diff()void mutable_gpu_data()void mutable_gpu_diff()

void Forward_cpu()void Forward_gpu()void forward_cpu_gemm()void forward_gpu_gemm()void forward_cpu_bias()void forward_gpu_bias()

void Backward_cpu()void Backward_gpu()void backward_cpu_gemm()void backward_gpu_gemm()void backward_cpu_bias()void backward_gpu_bias()

}

class ConvolutionLayer{public:

void data()void diff()

void mutable_data()void mutable_diff()

void Forward()void forward_gemm()void forward_bias()

void Backward()void backward_gemm()void backward_bias()

}

Existing Design

Unified Memory

Design

A. A. Awan, C-H Chu, H. Subramoni, X. Lu, D.K. Panda, “OC-DNN: Designing

Out-of-Core Deep Neural Network Training by Exploiting Unified Memory on

Pascal and Volta GPUs”, <double-blind submission under review>

Page 19: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

19

CAN THIS DESIGN OFFER GOOD PERF?DL training with Unified Memory

OC-Caffe will be released by the HiDL Team@OSU: hidl.cse.ohio-state.edu, mvapich.cse.ohio-state.edu

0

20

40

60

80

100

120

140

160

Batch 40 Batch 45Im

ages

/sec

ResNet-50 training on 1xV100 (16GB)

Caffe (BVLC) OC-Caffe

0

20

40

60

80

100

120

140

160

Batch 110 Batch 120

Imag

es/s

ec

VGG19 training on 1xV100 (16GB)

Caffe (BVLC) OC-Caffe

out-of-corein-memory out-of-corein-memory

Page 20: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

20

CONCURRENT ACCESS

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

A’s page table B’s page table

Page 21: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

21

CONCURRENT ACCESSExclusive Access*

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

atomicAdd_system(addr2, 1)

page fault

atomicAdd_system(addr2, 1)

local access

A’s page table B’s page table

*this is a possible implementation and to guarantee this behavior you need to use cudaMemAdvise policies

Page 22: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

22

CONCURRENT ACCESSExclusive Access

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

atomicAdd_system(addr2, 1)

page fault

page2 unmapped in B’s memory and migrated to A

A’s page table B’s page table

Page 23: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

23

CONCURRENT ACCESSExclusive Access

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

atomicAdd_system(addr2, 1)

local access

A’s page table B’s page table

Page 24: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

24

CONCURRENT ACCESSAtomics over NVLINK*

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

atomicAdd_system(addr2, 1)

remote access

*both processors need to support atomic operations

atomicAdd_system(addr2, 1)

local access

A’s page table B’s page table

Page 25: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

25

CONCURRENT ACCESSRead duplication*

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

val = *addr2;local access

*each processor must maintain its own page table

val = *addr2; local access

A’s page table B’s page table

Page 26: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

26

CONCURRENT ACCESSRead duplication: write

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

*addr2 = val2;local access

a write will collapse all copies into one

A’s page table B’s page table

Page 27: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

27

CONCURRENT ACCESSRead duplication: read after write

page1

page2

page3

A’s phys mem B’s phys mem page1

page2

page3

val = *addr2; local access

pages are duplicated again on faults

A’s page table B’s page table

Page 28: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

28

ANALYTICS USE CASEDesign of a Concurrent Hybrid Hash Table

GPU 0 GPU 1

Multiple CPU Cores

Hash table implemented

via Unified Memory

Concurrent Inserts and Fetches

Non-blocking updates

using atomic compare&swap

Concurrent Fetches Concurrent Fetches

S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F

Page 29: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

29

ANALYTICS USE CASEConcurrent Access To Hash Table

HBM

SM

L2

SMSM

SYSMEM

page fault

x86

Page 30: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

30

SYSMEM

ANALYTICS USE CASEConcurrent Access To Hash Table

HBM

SM

L2

SMSM

page fault

page migration

access replay

x86

Page 31: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

31

SYSMEM

ANALYTICS USE CASEConcurrent Access To Hash Table

HBM

SM

L2

SMSM

P9

cache

P9 can directly update hash entry in GPU memory

no page faults or migrations!

HBM

SM

L2

SMSM

page fault

page migration

access replay

x86

S8172 - Evaluation of Hybrid Cache-Coherent Concurrent Hash Table on POWER9 System with NVLink 2.0 – Thu 11:00 Room 210F

Page 32: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

32

UNIFIED MEMORY + DGX-2

UNIFIED MEMORY PROVIDES

Single memory viewshared by all GPUs

Automatic migration of data between GPUs

User control of data locality

GPU0

GPU1

GPU2

GPU3

GPU4

GPU5

GPU6

GPU7

GPU8

GPU9

GPU10

GPU11

GPU12

GPU13

GPU14

GPU15

512 GB Unified Memory

Page 33: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

33

ENABLING MULTI-GPU

__global__ void kernel(int *data) {

int idx = threadIdx.x + blockDim.x * blockIdx.x;

doSomeStuff(idx, data, ...);

}

cudaMallocManaged(&data, N * sizeof(int));

// initialize data on the CPU

kernel<<<grid, block>>>(data);

__global__ void kernel(int *data, int gpuId) {

int idx = threadIdx.x + blockDim.x * (blockIdx.x

+ gpuId * gridDim.x);

doSomeStuff(idx, data, ...);

}

cudaMallocManaged(&data, N * sizeof(int));

// initialize data on the CPU

for (int i = 0; i < numGPUs; i++) {

cudaSetDevice(i);

kernel<<<grid/numGPUs, block>>>(data, i);

}

Single-GPU Multi-GPU

update launch config

update blockIdx.x

Page 34: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

34

MULTI-GPU WITH UNIFIED MEMORY

GPU 3

GPU 0

GPU 2

GPU 1

SYSMEM

GPU kernels initiate migrations

Page 35: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

35

MULTI-GPU WITH UNIFIED MEMORY

GPU 3

GPU 0

GPU 2

GPU 1

Data automaticallypartitioned between GPUs

on first-touchSYSMEM

Page 36: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

36

MULTI-GPU WITH UNIFIED MEMORY

GPU 3

GPU 0

GPU 2

GPU 1

With policies:Remote data can be accessed

directly without migrationsSYSMEM

Page 37: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

37

MULTI-GPU WITH UNIFIED MEMORY

GPU 3

GPU 0

GPU 2

GPU 1

With policies:Read-only data can be

duplicated and accessed locallySYSMEM

Page 38: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

38

GPU ARCHITECTURE AND SOFTWARE EVOLUTION

Page 39: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

39

UNIFIED MEMORYEvolution of GPU architectures

2012 2014 2016 2017

KeplerRelease of the new “single-

pointer” programming

model

MaxwellNo new features

related to Unified Memory Pascal

On-demand migration,

concurrent access and atomics,

oversubscription

VoltaAccess counters,

P9 only: hardware coherency, ATS

support

NVLINK1

NVLINK2

*Not all features are available on all platforms

Page 40: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

40

UNIFIED MEMORY: BEFORE PASCAL

No GPU page fault support: move all dirty pages on kernel launch

No concurrent access, no GPU memory oversubscription, no system-wide atomics

Available since CUDA 6

page1

page2

page3

A’s page table A’s phys mem B’s phys mem

page1

page2

page3

B’s page table

kernellaunch

Page 41: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

41

UNIFIED MEMORY: PASCAL AND VOLTA

GPU page fault support, concurrent access, extended VA space (48-bit)

On-demand migration to accessing processor on first touch

Available since CUDA 8

page1

page2

A’s phys mem B’s phys mem

page1

page2

A’s page table B’s page table

Page 42: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

42

UNIFIED MEMORY ON VOLTA+P9

If memory is mapped to the GPU, migration can be triggered by access counters

New Feature: Access Counters

page1

page2

A’s phys mem B’s phys mem

page1

page2

few accesses

many accesses

A’s page table B’s page table

Page 43: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

43

UNIFIED MEMORY ON VOLTA+P9

With access counters only hot pages will be moved to the GPU

Migrations are delayed compared to the fault-based method

New Feature: Access Counters

page1

page2

A’s phys mem B’s phys mem

page1

page2

many accesses

*When implemented this feature can be enabled with cudaMemAdvise policies

A’s page table B’s page table

Page 44: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

44

UNIFIED MEMORY ON VOLTA+P9

CPU can directly access and cache GPU memory

Native atomics support for all accessible memory

New Feature: Hardware Coherency with NVLINK2

page1

page2

GPU mem system mem

page1

page2

V100 P9

V100’s page table P9’s page table

Page 45: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

45

UNIFIED MEMORY ON VOLTA+P9

ATS: address translation service; CPU and GPU can share a single page table

New Feature: ATS support

page1 page2

GPU mem CPU mem

V100 P9

ATS page table

Page 46: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

46

UNIFIED MEMORY WITH SYSTEM ALLOCATOR

System allocator support allows GPU to access all system memory

malloc, stack, global, file system

P9: Address Translation Service (ATS)

Support enabled in CUDA 9.2

x86: Heterogeneous Memory Management (HMM)

Initial version of the patchset is integrated into 4.14 kernel

NVIDIA will be supporting upcoming versions of HMM

https://lkml.org/lkml/2017/6/23/443

Page 47: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

47

WHAT YOU CAN DO WITH UNIFIED MEMORYSee it in action at the end of the talk!

int *data;

cudaMallocManaged(&data, sizeof(int) * n);

kernel<<<grid, block>>>(data);

int *data = (int*)malloc(sizeof(int) * n);

kernel<<<grid, block>>>(data);

int data[1024];

kernel<<<grid, block>>>(data);

int *data = (int*)alloca(sizeof(int) * n);

kernel<<<grid, block>>>(data);

extern int *data;

kernel<<<grid, block>>>(data);

Works everywhere todayWorks on Power9 + ATS in CUDA 9.2

Will work in the future on x86 + HMM

Page 48: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

48

UNIFIED MEMORY LANGUAGES

CUDA C/C++: cudaMallocManaged

CUDA Fortran: managed attribute (per allocation)

Python: pycuda.driver.managed_empty (allocate numpy.ndarray)

OpenACC: -ta=managed compiler option (all dynamic allocations)

Page 49: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

49

UNIFIED MEMORY + OPENACC

Literally adding a single line will get your code running on the GPU

Easy to optimize later: add loop and data directives

Effortless way to run you code on GPUs

#pragma acc kernels

{

for (i = 0; i < n; ++i) {

c[i] = a[i] + b[i];

...

}

}

...

Initiate parallel execution

Page 50: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

50

GYROKINETIC TOROIDAL CODEParticle-In-Cell production code

http://phoenix.ps.uci.edu/gtc_group

0

2

4

6

8

10

12

14

16

18

CPU only 2xV100Data

Directives

2xV100UnifiedMemory

4xV100Data

Directives

4xV100UnifiedMemory

Speed-u

p

CPU: Haswell E5-2698 v3 @ 2.30GHz, dual socket 16-core; MPI: MVAPICH-GDR

Page 51: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

51

PERFORMANCE DEEP DIVE

Page 52: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

52

STREAMING BENCHMARKHow fast is on-demand migration?

0

5

10

15

20

25

30

35

x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK)G

B/s

Streaming performance

on-demand  prefetch  memcpy

__global__ void kernel(int *host, int *device) {

int i = threadIdx.x + blockDim.x * blockIdx.x;

device[i] = host[i];

}

// allocate and initialize memory

cudaMallocManaged(&host, size);

cudaMalloc(&device, size);

memset(host, 0, size);

// benchmark CPU->GPU migration

if (prefetch)

cudaMemPrefetchAsync(host, size, gpuId);

kernel<<<grid, block>>>(host, device);

Page 53: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

53

==14487== Profiling result:

Type Time(%) Time Calls Avg Min Max Name

GPU activities: 100.00% 23.270ms 1 23.270ms 23.270ms 23.270ms void kernel(int*, int*)

API calls: 79.56% 23.272ms 1 23.272ms 23.272ms 23.272ms cudaDeviceSynchronize

20.42% 5.9732ms 1 5.9732ms 5.9732ms 5.9732ms cudaLaunch

0.01% 2.0490us 1 2.0490us 2.0490us 2.0490us cudaConfigureCall

0.01% 1.8360us 4 459ns 138ns 833ns cudaSetupArgument

==14487== Unified Memory profiling result:

Device "Tesla V100-PCIE-16GB (0)"

Count Avg Size Min Size Max Size Total Size Total Time Name

3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device

81 - - - - 23.23181ms Gpu page fault groups

UNDERSTANDING PROFILER OUTPUT

Page 54: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

54

HEURISTIC PREFETCHING

GPU architecture supports different page sizes

Contiguous pages up to a large page size are promoted to the larger size

Driver prefetches whole regions if pages are accessed densely

Do Not Confuse with API-prefetching

GPU

CPU

4KB 4KB 952KB

1024KB

Page 55: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

55

WHAT IS PAGE FAULT GROUPS?

==14487== Unified Memory profiling result:

Device "Tesla V100-PCIE-16GB (0)"

Count Avg Size Min Size Max Size Total Size Total Time Name

3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device

81 - - - - 23.23181ms Gpu page fault groups

Unified Memory Virtual Address Name

8 0x3900010000 [Unified Memory GPU page faults]

9 0x3900040000 [Unified Memory GPU page faults]

5 0x3900108000 [Unified Memory GPU page faults]

5 0x3900200000 [Unified Memory GPU page faults]

nvprof --print-gpu-trace ...

Page 56: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

56

PAGE FAULTS HANDLING

128B 128B 128B 128B 128B

thread_4B

warp 0 warp 1

GPU page

warp 2

faulted

stalled stalled

warp 3

faulted

resumed resumed

faultprocessing

replayed replayed

Page 57: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

57

OPTIMIZING ON-DEMAND MIGRATION

64KB 64KB

warp_64KB

warp 0 warp 1

GPU page GPU page

Increase fault concurrency to reduce page fault stalls

128B 128B 128B 128B 128B

thread_4B

warp 0 warp 1

GPU page

fewer warps are stalled“spread-out” pattern improves prefetching

multiple faults per pagewarps are stalled on fault processing

Page 58: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

58

OPTIMIZING ON-DEMAND MIGRATIONCount Avg Size Min Size Max Size Total Size Total Time Name

3012 21.758KB 4.0000KB 952.00KB 64.00000MB 13.49043ms Host To Device

81 - - - - 23.23181ms Gpu page fault groups

Unified Memory Virtual Address Name

8 0x3900010000 [Unified Memory GPU page faults]

9 0x3900040000 [Unified Memory GPU page faults]

5 0x3900108000 [Unified Memory GPU page faults]

Count Avg Size Min Size Max Size Total Size Total Time Name

957 68.481KB 4.0000KB 576.00KB 64.00000MB 8.242080ms Host To Device

6 - - - - 9.769984ms Gpu page fault groups

Unified Memory Virtual Address Name

1 0x39000d0000 [Unified Memory GPU page faults]

1 0x39000c0000 [Unified Memory GPU page faults]

1 0x3900080000 [Unified Memory GPU page faults]

Thread/4B

Warp/64KB

more efficient prefetching

fewer stalls

Page 59: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

59

STREAMING BENCHMARKHow fast is on-demand migration?

0

5

10

15

20

25

30

35

x86 + P100 (PCIe) x86 + V100 (PCIe) P8 + P100 (NVLINK)

GB

/s

Streaming performance

on-demand: thread/4B on-demand: warp/64K  prefetch  memcpy

128B 128B 128B 128B 128B

64KB 64KB

thread_4B

warp_64KB

warp 0 warp 1

warp 0 warp 1

Also check the Parallel Forall blog: https://devblogs.nvidia.com/maximizing-unified-memory-performance-cuda/

Page 60: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

60

0

5

10

15

20

25

10 20 30 40 50

GB

/s

Working set (GB)

Fault-based migration throughput(P8/NVLINK)

thread/4B warp/64K

thread/4B + readmostly warp/64K + readmostly

0

1

2

3

4

5

6

7

8

9

10

10 20 30 40 50

GB

/s

Working set (GB)

Fault-based migration throughput(x86/PCIe)

thread/4B warp/64K

thread/4B + readmostly warp/64K + readmostly

GPU MEMORY OVERSUBSCRIPTIONLet’s see how perf changes as we increase the working set

GPUmemory limit

GPUmemory limit

Page 61: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

61

EVICTION ALGORITHM

Driver keeps a list of physical chunks of GPU memory

Chunks from the front of the list are evicted first (LRU)

A chunk is considered “in use” when it is fully-populated or migrated

Prefetching and policies may impact eviction heuristic in the future

What Pages Are Moving Out of the GPU

eviction

allocation

or

migration

Page 62: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

62

PREFETCHING AND EVICTIONS

0

2

4

6

8

10

12

on-demandthread/4B

on-demandwarp/64K

1-way prefetch

GB

/s

Migration throughput (x86+V100)

Without eviction With evictionPrefetch can be overlapped with evictions

without using CUDA streams!(enabled in CUDA 9.1)

cudaMemcpy solution requires scheduling DtoH and HtoD copies

into two separate streams

Page 63: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

63

DRIVER ENHANCEMENTSTesla V100 + x86: HPGMG-AMR default (no hints)

0

50

100

150

200

250

1.4 4.7 8.6 28.9 58.6

MD

OF/

s

Working set (GB)

384.81 (CUDA 8.0 GA2)

387.34 (CUDA 9.0)

Improvements for the allocator

Page 64: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

64

DRIVER ENHANCEMENTSTesla V100 + x86: HPGMG-AMR optimized (prefetches)

0

50

100

150

200

250

1.4 4.7 8.6 28.9 58.6

MD

OF/

s

Working set (GB)

384.81 (CUDA 8.0 GA2)

387.34 (CUDA 9.0)

390.30 (CUDA 9.1)

Prefetch + EvictionsOverlap

Page 65: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

65

LOCALITY AND ACCESS CONTROL

Default: data migrates on first touch

ReadMostly: data duplicated on first touch

PreferredLocation: resist migrating away from the preferred location

AccessedBy: establish direct mapping and avoid faults

cudaMemAdvise

GPU1GPU0

GPU1GPU0

GPU1GPU0

GPU1GPU0

Page 66: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

66

while not converged:LagrangeNodal CalcForceForNodesCalcAccelerationForNodesBoundaryConditionsForNodesCalcVelocityForNodesCalcPositionForNodes

LagrangeElementsCalcLagrangeElementsCalcQForElemsMaterialPropertiesForElems

CalcTimeConstraintsForElemsCalcCourantConstraintForElemsCalcHydroConstraintForElems

LULESHCORAL OpenACC app

few heavy GPU kernels

many small GPU kernelsmany small CPU functions

Page 67: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

67

USING VISUAL PROFILER

Kernel time = 17ms Kernel time = 27ms

1st iteration 2nd iteration longer, why?

what’s this?

Page 68: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

68

THRASHING MITIGATION

Before CUDA 9.2: when memory is pinned we lose any insight into access pattern

In the future we may use access counters on Volta to find a better location

Processors frequently read or write to the same page

GPU

CPU

Page Throttling Remote Map

Memory thrashing detected

Page 69: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

69

LULESH: PREFERRED LOCATION = GPU

No policies

PreferredLocation

lots of back-and-forthmigrations

but no slowdownfor the heavy kernels

anti-thrashing heuristic kicks in and pins the page to CPU

Page 70: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

70

LULESH: READ MOSTLY

no slowdownfor the heavy kernels

almost no thrashingfor the hybrid CPU/GPU part

Page 71: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

71

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

Default Preferred ReadMostly Improved DefaultHeuristics

FOM

(z/

s)

LULESH performance, 200^3 mesh, 100 iterations, CUDA 9.1

Quadro GP100 Quadro GV100

LULESH: PERF IMPROVEMENTS

GP100 has 500x moresysmem writes than GV100

Will be enabled in the future drivers

Page 72: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

72

WHEN TO USE UNIFIED MEMORY

cudaMalloc cudaMallocManaged

Pinned allocation cudaMalloc

cudaMallocManagedPreferredLocation(GPU)SetAccessedBy(peer GPUs)cudaMemPrefetchAsync(GPU)

cudaMemcpy: ptrA -> ptrBStaging for non-pinned allocations

or between non-P2P GPUs

Staging or a copy kernel

required in all cases

Memory migration Not possible cudaMemPrefetchAsync

Debugging Difficult Easy

Oversubscription No Yes

IPC support Yes No

Page 73: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

73

WHEN TO USE UNIFIED MEMORY

Application Code

Sequential code (CPU)

Sequential code (CPU)

Parallel code (GPU)

cudaMalloc

cudaMallocManaged/malloc

cudaMallocManaged/malloc

Parallel Code

Page 74: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User
Page 75: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

75

UNIFIED MEMORY PLATFORMS

KEPLER PASCAL VOLTA

Linux + x86No GPU fault support

No concurrent accessOn-demand migration On-demand migration

Linux + Power

On-demand migration

80GB/s CPU-GPU BW*On-demand migration

150GB/s CPU-GPU BW**

Access counters

HW coherency

ATS support

WindowsNo GPU fault support

No concurrent access

MacOSNo GPU fault support

No concurrent access

TegraCached on CPU and iGPU

No concurrent access

*IBM Minsky: 4xP100 + 2xP8, 2xNVLINK1 links between P100 and P8, bi-directional aggregate BW**IBM Newell: 4xV100 + 2xP9, 3xNVLINK2 links between V100 and P9, bi-directional aggregate BW

Page 76: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

76

READ DUPLICATIONUsage example

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..SetReadMostly, myGpuId);cudaMemPrefetchAsync(data, N, myGpuId, s);mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

The prefetch creates a copy instead of

moving data

Both processors can read data

simultaneously without faults

Writes will collapse all copies into one,

subsequent reads will fault and duplicate

GPU: my_kernel

CPU: init_data

CPU: use_data

Page 77: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

77

PREFERRED LOCATIONResisting migrations

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

The kernel will page fault and

generate direct mapping to

data on the CPU

The driver will “resist”

migrating data away from the

preferred location

GPU: my_kernel

CPU: init_data

CPU: use_data

Page 78: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

78

PREFERRED LOCATIONPage population on first-touch

char *data;cudaMallocManaged(&data, N);

cudaMemAdvise(data, N, ..PreferredLocation, cudaCpuDeviceId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

The kernel will page fault,

populate pages on the CPU

and generate direct mapping to

data on the CPU

Pages are populated on the

preferred location if the

faulting processor can access it

GPU: my_kernelCPU: use_data

Page 79: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

79

PREFERRED LOCATION ON P9+V100CPU can directly access GPU memory

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..PreferredLocation, gpuId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

The kernel will page fault and

migrate data to the GPU

CPU will fault and access data

directly instead of migrating

on non P9+V100 systems the driver will migrate back to the CPU

GPU: my_kernel

CPU: init_data

CPU: use_data

Page 80: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

80

ACCESSED BYUsage example

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

GPU will establish direct mapping of

data in CPU memory, no page faults

will be generated

Memory can move freely to other

processors and mapping will carry

over

GPU: my_kernel

CPU: init_data

CPU: use_data

Page 81: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

81

ACCESSED BYUsing access counters on Volta

char *data;cudaMallocManaged(&data, N);

init_data(data, N);

cudaMemAdvise(data, N, ..SetAccessedBy, myGpuId);

mykernel<<<..., s>>>(data, N);

use_data(data, N);

cudaFree(data);

GPU will establish direct mapping of

data in CPU memory, no page faults

will be generated

Access counters may eventually

trigger migration of frequently

accessed pages to the GPU

GPU: my_kernel

CPU: init_data

CPU: use_data

Page 82: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

82

ptr = malloc(size);

doStuffOnGpu<<<...>>>(ptr, size);

MANAGED VS MALLOC ON VOLTA+P9First touch allocation policy

GPU page faults

Unified Memory driver allocates on GPU

GPU accesses GPU memory

ptr = cudaMallocManaged(size);

doStuffOnGpu<<<...>>>(ptr, size);

*You may alter this behavior by using cudaMemAdvise policies

GPU uses ATS, faults

OS allocates on CPU (by default)

GPU uses ATS to access CPU memory

Page 83: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

83

MANAGED VS MALLOC ON P9cudaMallocManaged: same behavior as x86

ptr = cudaMallocManaged(size);

fillData(ptr, size);

doStuffOnGpu<<<...>>>(ptr, size);

cudaDeviceSynchronize();

doStuffOnCpu(ptr, size);

GPU page faultsptr migrated to GPU

CPU page faultsptr migrated to CPU

Page 84: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

84

MANAGED VS MALLOC ON P9malloc: no on-demand migrations*

ptr = malloc(size);

fillData(ptr, size);

doStuffOnGpu<<<...>>>(ptr, size);

cudaDeviceSynchronize();

doStuffOnCpu(ptr, size);

GPU uses ATS to access CPU memory(no on-demand migrationexcept cudaMemPrefetchAsync*)

CPU accesses CPU memory

*In the future Volta access counters will be used to migrate malloc memory

Page 85: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

85

HYPRE-INSPIRED USE CASE

Algebraic Multi-Grid library: https://github.com/LLNL/hypre

Lots of small allocations: multiple variables may end up on the same page

If used by different processors this will result in false-sharing

512B 512B 512B 512B 512B 512B 512B 512B

GPU CPU

single 4KB page

Page 86: EVERYTHING YOU NEED TO KNOW ABOUT UNIFIED MEMORY...UNIFIED MEMORY + DGX-2 UNIFIED MEMORY PROVIDES Single memory view shared by all GPUs Automatic migration of data between GPUs User

86

FALSE-SHARING

Issues with false-sharing:

- Spurious migrations, thrashing mitigation does not solve it

- Performance hints are applied on page boundaries, due to suballocation data may inherit the wrong policies

How to mitigate this:

- Use separate allocators or memory pools for CPU and GPU