An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

25
An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems Isaac Gelado, Javier Cabezas. John Stone, Sanjay Patel, Nacho Navarro and Wen-mei Hwu 3/17/20 10 ASPLOS 2010 -- Pittsburgh 1

description

An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems. Isaac Gelado , Javier Cabezas . John Stone, Sanjay Patel, Nacho Navarro and Wen-mei Hwu. 1. Introduction: Heterogeneous Computing. CPU. ACC. Existent programming models are DMA based: Explicit memory copy - PowerPoint PPT Presentation

Transcript of An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

Page 1: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

1

An Asymmetric Distributed Shared Memory Model for Heterogeneous

Parallel Systems

Isaac Gelado, Javier Cabezas. John Stone,Sanjay Patel, Nacho Navarro and Wen-mei Hwu

3/17/2010

Page 2: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

2

1. Introduction: Heterogeneous Computing

3/17/2010

Heterogeneous Parallel Systems: CPU: sequential control-intensive code Accelerators: massively data-parallel

code

CPU

ACC

Existent programming models are DMA based: Explicit memory copy Programmer-managed memory

coherence

CPU

ACC

IN

IN

OUT

OUT

Page 3: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

3

Outline

1. Introduction2. Motivation3. ADSM: Asymmetric Distributed Shared

Memory4. GMAC: Global Memory for ACcelerators5. Experimental Results6. Conclusions

3/17/2010

Page 4: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

4

2.1 Motivation: Reference System

3/17/2010

CPU(N - Cores)

PCIe Bus

RAM Memory

RAM Memory

GPU-likeAccelerator

High bandwidthWeak consistencyLarge page size

Low latencyStrong consistencySmall page size

RAM Memory

RAM Memory

Device Memory

System Memory

Page 5: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

5

2.2 Motivation: Memory Requirements High memory

bandwidth requirements

Non fully-coherent systems: Long-latency

coherence traffic Different coherence

protocols Accelerator memory always growing (e.g. 6GB

NVIDIA Fermi, 16GB PowerXCell 8i)

3/17/2010

Page 6: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

6

2.3 Motivation: DMA-Based Programming

3/17/2010

• Duplicated Pointers

• Explicit Coherence Management

• CUDA Sample Code

CPU GPU

foo foofoofoo

void compute(FILE *file, int size) { float *foo, *dev_foo; foo = malloc(size); fread(foo, size, 1, file); cudaMalloc(&dev_foo, size); cudaMemcpy(dev_foo, foo, size, cudaMemcpyHostToDevice); kernel<<<Dg, Db>>>(dev_foo, size); cudaMemcpy(foo, dev_foo, size, cudaMemcpyDeviceToHost); cpuComputation(foo); cudaFree(dev_foo); free(foo);}

Page 7: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

7

3.1 ADSM: Unified Virtual Address Space• Unified Virtual Shared Address Space

• CPU: access both, system and accelerator memory• Accelerator: access to its own memory• Under ADSM, both will use the same virtual address when

referencing the shared object

CPU ACCbar

baz

foo foo

Shared DataObject

3/17/2010

System Memory

Device Memory

Page 8: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

8

3.2 ADSM: Simplified Code• Simpler CPU code than in DMA-

based programming models• Hardware-independent code

Single PointerData

Assignment

Peer DMALegacy

Support3/17/2010

void compute(FILE *file, int size) {

float *foo;

foo = adsmMalloc(size);

fread(foo, size, 1, file);

kernel<<<Dg, Db>>>(foo, size);

cpuComputation(foo);

adsmFree(foo);

}

CPU GPU

foo

Page 9: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

9

3.3 ADSM: Memory Distribution

• Asymmetric Distributed Shared Memory principles:• CPU accesses objects in accelerator memory but

not vice versa• All coherency actions are performed by the CPU

• Trashing unlikely to happen:• Synchronization Variables: Interrupt-based and

dedicated hardware• False-sharing: Data object sharing granularity

3/17/2010

Page 10: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

10

3.4 ADSM: Consistency and Coherence• Release consistency:

• Consistency only relevant from CPU perspective• Implicit release/acquire at accelerator call/return

CPU ACC

Foo

CPU ACC

FooAcceleratorReturnAccelerator

Call

3/17/2010

• Memory Coherence:• Data ownership information enables eager data transfers• CPU maintains coherency

Page 11: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

11

4. Global Memory for Accelerators

• ADSM implementation

• User-level shared library

• GNU / Linux Systems

• NVIDIA CUDA GPUs

3/17/2010

Page 12: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

12

4.1 GMAC: Overall Design

• Layered Design:• Multiple Memory Consistency Protocols• Operating System and Accelerator Independent code

CUDA-like Front-End

Memory Manager(Different Policies)

Kernel Scheduler(FIFO)

Operating SystemAbstraction Layer

Accelerator AbstractionLayer (CUDA)

3/17/2010

Page 13: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

13

4.2 GMAC: Unified Address Space

System Virtual Address Space

GPU Physical Address Space

3/17/2010

• Virtual Address Space formed by GPU and System physical memories

• GPU memory address range cannot be selected

• Allocate same virtual memory address range in both, GPU and CPU

• Accelerator Virtual memory would ease this process

Page 14: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

14

• Batch-Update: copy all shared objects• Lazy-Update: copy modified / needed shared

objects• Data object granularity• Detect CPU read/write accesses to shared objects

• Rolling-Update: copy only modified / needed memory• Memory block size granularity• Fixed maximum number of modified blocks in system

memory flush data when maximum is reached

4.3 GMAC: Coherence Protocols

3/17/2010

Page 15: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

15

5.1 Results: GMAC vs. CUDA

3/17/2010

• Batch-Update overheads:– Copy output data

on call– Copy non-used data

• Similar performance for CUDA, Lazy-Update and Rolling-Update

Page 16: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

16

5.2 Results: Lazy vs. Rolling on 3D Stencil• Extra data copy

for small data objects

• Trade-off between bandwidth and page fault overhead

3/17/2010

Page 17: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

17

6. Conclusions

• Unified virtual shared address space simplifies programming of heterogeneous systems

• Asymmetric Distributed Shared Memory• CPU access accelerator memory but not vice versa• Coherence actions only executed by CPU

• Experimental results shows no performance degradation

• Memory translation in accelerators is key to efficient implement ADSM

3/17/2010

Page 18: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

18

Thank you for your attention

Eager to start using GMAC?http://code.google.com/p/adsm/

[email protected]@googlegroups.com

3/17/2010

Page 19: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

19

Backup Slides

3/17/2010

Page 20: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

20

4.4 GMAC: Memory Mapping

• Software: allocate different address space and provide translation function (gmacSafePtr())

• Hardware: implement virtual memory in the GPU

3/17/2010

System Virtual Address Space

GPU Physical Address Space

• Allocation might fail if the range is in use

Page 21: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

21

4.5 GMAC: Protocol States

• Protocol States: Invalid, Read-only, Dirty

3/17/2010

Invalid

Dirty

Read Only

Call

Call

Read

Write

Write

Flush

Invalid Dirty

Call

Return• Batch-Update:

• Call / Return• Lazy-Update:

• Call / Return• Read / Write

• Rolling-Update:• Call / Return• Read / Write• Flush

Page 22: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

22

4.6 GMAC: Rolling vs. Lazy

3/17/2010

• Batch – Update: transfer on kernel call

• Rolling – Update: transfer while CPU computes

Page 23: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

23

5.3 Results: Break-down of Execution

3/17/2010

Page 24: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

24

5.4 Results: Rolling Size vs. Block Size

• No appreciable effect on most benchmarks

3/17/2010

• Small Rolling size leads to performance aberrations

• Prefer relative large rolling sizes

Page 25: An Asymmetric Distributed Shared Memory Model for Heterogeneous Parallel Systems

ASPLOS 2010 -- Pittsburgh

25

6.1 Conclusions: Wish—list

• GPU Anonymous Memory Mappings:• GPU to CPU mappings never fail• Dynamic memory re—allocations

• GPU dynamic Pinned Memory:• No intermediate data copies on flush

• Peer DMA:• Speed—up I/O operations• No intermediate copies on GPU-to-GPU copies

3/17/2010