@spcl eth Enabling Highly-Scalable Remote Memory Access...

spcl.inf.ethz.ch

@spcl_eth

ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER

Enabling Highly-Scalable Remote Memory

Access Programming with MPI-3 One Sided

spcl.inf.ethz.ch

@spcl_eth

MPI-3.0 supports RMA (“MPI One Sided”)

Designed to react to hardware trends

Majority of HPC networks support RDMA

2

MPI-3.0 REMOTE MEMORY ACCESS

[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf

spcl.inf.ethz.ch

@spcl_eth




3



spcl.inf.ethz.ch

@spcl_eth




4



spcl.inf.ethz.ch

@spcl_eth




Communication is „one sided” (no involvement of

destination)

RMA decouples communication & synchronization

Different from message passing

5



Proc A Proc B

send

recv

Proc A Proc B

put

two sided one sided

CommunicationCommunication

+

Synchronization Synchronizationsync

spcl.inf.ethz.ch

@spcl_eth

6

PRESENTATION OVERVIEW

5. Application evaluation

1. Overview of three

MPI-3 RMA concepts2. MPI window creation

3. Communication

4. Synchronization

spcl.inf.ethz.ch

@spcl_eth

7

MPI-3 RMA COMMUNICATION OVERVIEW

Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)

Atomic communication calls

(Acc, Get & Acc, CAS, FAO)

Memory

MPI window

…Process D (active)

…

spcl.inf.ethz.ch

@spcl_eth

8


Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)



Memory

MPI window


…

spcl.inf.ethz.ch

@spcl_eth

9


Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)



Memory

MPI window


…

spcl.inf.ethz.ch

@spcl_eth

10


Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)



Memory

MPI window


…

spcl.inf.ethz.ch

@spcl_eth

11


Process A (passive)

Memory

MPI window

Process B (active)

Process C (active)

Put

GetAtomic

Non-atomic

communication

calls (put, get)



Memory

MPI window


…

spcl.inf.ethz.ch

@spcl_eth

12

MPI-3.0 RMA SYNCHRONIZATION OVERVIEW

Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

spcl.inf.ethz.ch

@spcl_eth

13


Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

spcl.inf.ethz.ch

@spcl_eth

14


Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

spcl.inf.ethz.ch

@spcl_eth

15


Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

spcl.inf.ethz.ch

@spcl_eth

16


Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

spcl.inf.ethz.ch

@spcl_eth

Scalable & generic protocols

Can be used on any RDMA network (e.g., OFED/IB)

17

SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION

spcl.inf.ethz.ch

@spcl_eth



18


spcl.inf.ethz.ch

@spcl_eth

19


Window creation

CommunicationSynchronization



Window creation, communication and synchronization

spcl.inf.ethz.ch

@spcl_eth




foMPI, a fully functional MPI-3 RMA implementation

DMAPP: lowest-level networking API for Cray Gemini/Aries systems

XPMEM: a portable Linux kernel module

20


http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI

spcl.inf.ethz.ch

@spcl_eth







21



spcl.inf.ethz.ch

@spcl_eth







22



spcl.inf.ethz.ch

@spcl_eth

23

PART 1: SCALABLE WINDOW CREATIONTraditional windows

backwards compatible

(MPI-2)

Time bound: 𝒪 𝑝Memory bound: 𝒪 𝑝

𝑝 = total number

of processes

Process A

Memory

Process B

Memory

Process C

Memory

0x111

0x123

0x120

spcl.inf.ethz.ch

@spcl_eth

24

PART 1: SCALABLE WINDOW CREATIONAllocated windows

𝑝 = total number

of processes

Process A

Memory

Process B

Memory

Process C

Memory

Allows MPI

to allocate memory

Time bound: 𝒪 log 𝑝 (𝑤ℎ𝑝)Memory bound: 𝒪 1

0x123 0x1230x123

spcl.inf.ethz.ch

@spcl_eth

25

PART 1: SCALABLE WINDOW CREATIONDynamic windows

𝑝 = total number

of processes

Process A

Memory

Process B

Memory

Process C

Memory

Local attach/detach

Most flexible

Time bound: 𝒪 𝑝Memory bound: 𝒪 𝑝

0x129 0x129

0x111

0x123

0x120

spcl.inf.ethz.ch

@spcl_eth

Put and Get:

Direct DMAPP put and get operations or

local (blocking) memcpy (XPMEM)

Accumulate:

DMAPP atomic operations for 64 bit types

...or fall back to remote locking protocol

MPI datatype handling with MPITypes library [1]

Fast path for contiguous data transfers of common

intrinisic datatypes (e.g., MPI_DOUBLE)

26

PART 2: COMMUNICATION

[1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM’09

Contiguous memory

MPI_Put

dmapp_put_nbi

Remote

process

…

MPI_Compare

_and_swap

dmapp_

acswap_qw_nbi

Remote

process

…

spcl.inf.ethz.ch

@spcl_eth

27

PERFORMANCE INTER-NODE: LATENCY

Put Inter-Node Get Inter-Node

20%faster

80%faster

Proc 0 Proc 1

put

sync memory

Half ping-pong

spcl.inf.ethz.ch

@spcl_eth

28

PERFORMANCE INTRA-NODE: LATENCY

Put/Get Intra-Node 3xfaster

Proc 0 Proc 1

put

sync memory

Half ping-pong

spcl.inf.ethz.ch

@spcl_eth

29

PERFORMANCE: OVERLAP

Inter-Node Overlap in %

Useful for, e.g., scientific codes:

3D FFT

MILC

Proc 0 Proc 1

put

Sync memory

comp.

AWM-Olsen

seismic

spcl.inf.ethz.ch

@spcl_eth

30

PERFORMANCE: MESSAGE RATE

Intra-NodeInter-Node

Proc 0 Proc 1

puts

Sync memory

...

spcl.inf.ethz.ch

@spcl_eth

31

PERFORMANCE: ATOMICS

64 bit integers

hardware-

accelerated

protocol:

fall back

protocol:

lower latency

higher

bandwidth

proprietary

spcl.inf.ethz.ch

@spcl_eth

32

PART 3: SYNCHRONIZATION

Active

process

Passive

process

Synchroni-

zation

Passive Target Mode

Lock

Lock All

Active Target Mode

Fence

Post/Start/

Complete/Wait

Communi-

cation

spcl.inf.ethz.ch

@spcl_eth

Node 0 Node 1

SCALABLE FENCE IMPLEMENTATION

Collective call

Completes all outstanding memory operations

Proc 2

put

Proc 0 Proc 1 Proc 3

int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;

}

put

put

put

put

put

put

33

spcl.inf.ethz.ch

@spcl_eth

Node 0 Node 1


Collective call


Proc 2Proc 0 Proc 1 Proc 3


}

put

put

put

Local completion

(XPMEM)

34

spcl.inf.ethz.ch

@spcl_eth

Node 0 Node 1


Collective call




}

Local completion

(DMAPP)

35

spcl.inf.ethz.ch

@spcl_eth

Node 0 Node 1


Collective call




}

Global

completion

barrier

36

spcl.inf.ethz.ch

@spcl_eth

37

SCALABLE FENCE PERFORMANCE

Time bound 𝒪 log𝑝

Memory bound 𝒪 1

90%

faster

spcl.inf.ethz.ch

@spcl_eth

38

PSCW SYNCHRONIZATION

post

wait

start

complete

accessepoch

exposureepoch

Proc 0 Proc 1

matchingalgorithm

matchingalgorithm

allows to

access other

processesallows access

from other

processes

Posting

processPuts

…

Starting

process

Puts

…

spcl.inf.ethz.ch

@spcl_eth

39

PSCW SYNCHRONIZATION

post

wait

start

complete

Proc 0 Proc 1

start

complete

start

complete

Proc 2 Proc 3

start

complete

post

wait

Proc 4 Proc 5

spcl.inf.ethz.ch

@spcl_eth

In general, there can be n posting and m

starting processes

In this example there is one posting and

4 starting processes

40

PSCW SCALABLE POST/START MATCHING

Posting process

(opens its window)

j4

j1

j3

j2i

spcl.inf.ethz.ch

@spcl_eth

In general, there can be n posting and m

starting processes

In this example there is one posting and

4 starting processes

41


j4

j1

j3

j2i

Starting processes

(access remote window)

spcl.inf.ethz.ch

@spcl_eth

Each starting process has a local list

42


j4

j1

j3

j2i

Local list

spcl.inf.ethz.ch

@spcl_eth

Posting process i adds its rank i to a

list at each starting process j1, . . . , j4

Each starting process j waits

until the rank of the posting

process i is present in its local list

43


j4

j1

j3

j2i

spcl.inf.ethz.ch

@spcl_eth






44


j4

j1

j3

j2i

spcl.inf.ethz.ch

@spcl_eth






45


j4

j1

j3

j2i

i

spcl.inf.ethz.ch

@spcl_eth






46


j4

j1

j3

j2i

i

i

spcl.inf.ethz.ch

@spcl_eth






47


j4

j1

j3

j2i

i

i

i

spcl.inf.ethz.ch

@spcl_eth






48


j4

j1

j3

j2i

i

i

i

i

spcl.inf.ethz.ch

@spcl_eth






49


j4

j1

j3

j2i

i

i

i

i

spcl.inf.ethz.ch

@spcl_eth

Each starting process increments a counter stored at the

posting process

50

PSCW SCALABLE COMPLETE/WAIT MATCHING

ij4

j1

j3

j2

0

spcl.inf.ethz.ch

@spcl_eth


posting process

51


ij4

j1

j3

j2

1

spcl.inf.ethz.ch

@spcl_eth


posting process

52


ij4

j1

j3

j2

2

spcl.inf.ethz.ch

@spcl_eth


posting process

53


ij4

j1

j3

j2

3

spcl.inf.ethz.ch

@spcl_eth

54


ij4

j1

j3

j2

4


posting process

When the counter is equal to the number of starting

processes, the posting process returns from wait

spcl.inf.ethz.ch

@spcl_eth

55

PSCW PERFORMANCE

Time bound

𝒫𝑠𝑡𝑎𝑟𝑡 = 𝒫𝑤𝑎𝑖𝑡 = 𝒪 1𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 𝒪 log 𝑝

Memory bound

𝒪 log 𝑝 (for scalable programs)

Ring Topology

spcl.inf.ethz.ch

@spcl_eth

Two-level lock hierarchy:

56

SCALABLE LOCK SYNCHRONIZATION

Active

process

Passive

processLock/Unlock

(shared/exclusive)

Lock All

(always shared)

Process 0

Memory

00000 0local:

Shared Counter Exclusive Bit

Process 1

Memory

00000 0local:


Process P-1

Memory

00000 0local:


…000 000global:

SharedCounter

ExclusiveCounter

Master Process

spcl.inf.ethz.ch

@spcl_eth

PHASE 1: increment the global exclusive counter(Invariant 1: no global shared lock held concurrently)

EXCLUSIVE LOCAL LOCK: TWO PHASES

Proc 2 wants to lock Proc 1 exclusively

Process 0

00000 0

000 000

Process 2

00000 0

Process 1

00000 0

57

spcl.inf.ethz.ch

@spcl_eth

PHASE 1: increment the global exclusive counter(Invariant 1: no global shared lock held concurrently)


Process 0

00000 0

000 001

Process 2

00000 0

Process 1

00000 0

fetch-add

000 000

MPI_Win_lock( EXCL, 1 )

58


spcl.inf.ethz.ch

@spcl_eth

PHASE 1: increment the global exclusive counter(Invariant 2: no local shared/exclusive lock held concurrently)


Process 0

00000 0

000 001

Process 2

00000 0

Process 1

00000 1

fetch-add

000 000

MPI_Win_lock( EXCL, 1 )

compare &swap

00000 0

59


spcl.inf.ethz.ch

@spcl_eth

Increment local shared counter(Invariant: no local exclusive lock on this process held concurrently)

SHARED LOCAL LOCK: ONE PHASE

Proc 0 wants to lock Proc 1

Process 0

00000 0

000 001

Process 2

00000 0

Process 1

00000 0

60

spcl.inf.ethz.ch

@spcl_eth

Increment local shared counter(Invariant: no local exclusive lock on this process held concurrently)

SHARED LOCAL LOCK: ONE PHASE

Proc 0 wants to lock Proc 1

Process 0

00000 0

000 001

Process 2

00000 0

Process 1

00001 0

MPI_Win_lock( SHRD, 1 )

fetch-add

00000 0

61

spcl.inf.ethz.ch

@spcl_eth

Increment global shared counter(Invariant: no local exclusive lock is held concurrently)

SHARED GLOBAL LOCK: ONE PHASE

Proc 2 wants to lock the whole window

Process 0

00000 0

000 000

Process 2

00000 0

Process 1

00000 0

62

spcl.inf.ethz.ch

@spcl_eth

Increment global shared counter(Invariant: no local exclusive lock is held concurrently)

SHARED GLOBAL LOCK: ONE PHASE

Proc 2 wants to lock the whole window

Process 0

00000 0

001 000

Process 2

00000 0

Process 1

00000 0

63

MPI_Win_lock_all()

fetch-add

000 000

Constant number of operations for 𝑝 processes

spcl.inf.ethz.ch

@spcl_eth

Guarantees remote completion

Issues a remote bulk synchronization and an x86 mfence

One of the most performance critical functions, we add only 78 x86

CPU instructions to the critical path

64

FLUSH SYNCHRONIZATION Time bound 𝒪 1

Memory bound 𝒪(1)

Process 0 Process 1

inc(counter)

0counter:

…

inc(counter)

inc(counter)

spcl.inf.ethz.ch

@spcl_eth

Guarantees remote completion

Issues a remote bulk synchronization and an x86 mfence

One of the most performance critical functions, we add only 78 x86

CPU instructions to the critical path

65

FLUSH SYNCHRONIZATION Time bound 𝒪 1

Memory bound 𝒪(1)

Process 0 Process 1

inc(counter)

3counter:

…

inc(counter)

inc(counter)

flush

spcl.inf.ethz.ch

@spcl_eth

66

Evaluation on Blue Waters System

22,640 computing Cray XE6 nodes

724,480 schedulable cores

All microbenchmarks

4 applications

One nearly full-scale run

PERFORMANCE

spcl.inf.ethz.ch

@spcl_eth

67

PERFORMANCE: MOTIF APPLICATIONS

Key/Value Store:

Random Inserts per SecondDynamic Sparse Data Exchange (DSDE)

with 6 neighbors

spcl.inf.ethz.ch

@spcl_eth

68

PERFORMANCE: APPLICATIONS

NAS 3D FFT [1] Performance MILC [2] Application Execution Time

Annotations represent performance gain of foMPI over Cray MPI-1.

[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09

[2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12

scale

to 512k procsscale

to 65k procs

spcl.inf.ethz.ch

@spcl_eth

69

CONCLUSIONS & SUMMARY

spcl.inf.ethz.ch

@spcl_eth

70


1. MPI window creation

routines

spcl.inf.ethz.ch

@spcl_eth

71


2. Non-atomic & atomic

communication1. MPI window creation

routines

spcl.inf.ethz.ch

@spcl_eth

72


3. Fence / PSCW1. MPI window creation

routines


communication

spcl.inf.ethz.ch

@spcl_eth

73



routines


communication3. Fence / PSCW

4. Locks

spcl.inf.ethz.ch

@spcl_eth


routines


communication

74


3. Fence / PSCW

4. Locks

5. foMPI reference implementation

spcl.inf.ethz.ch

@spcl_eth

75



routines



4. Locks


spcl.inf.ethz.ch

@spcl_eth

76



routines



4. Locks


spcl.inf.ethz.ch

@spcl_eth

Thanks to:

Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth,

Nick Wright, Paul Hargrove (and the whole UPC team)

and the MPI Forum RMA WG …

… and the institutions:

77

ACKNOWLEDGMENTS

spcl.inf.ethz.ch

@spcl_eth

Thank you

for your attention

78

http://spcl.inf.ethz.ch/Research/

Parallel_Programming/foMPI

spcl.inf.ethz.ch

@spcl_eth

Backup slides

79

spcl.inf.ethz.ch

@spcl_eth

DYNAMIC WINDOW CREATION

61

Process A

Memory

Process B

Memory

Process C

Memory

0 0 0

spcl.inf.ethz.ch

@spcl_eth


61

Process A

Memory

Process B

Memory

Process C

Memory

1 0 1

0x111

0x120

spcl.inf.ethz.ch

@spcl_eth


61

Process A

Memory

Process B

Memory

Process C

Memory

2 1 2

0x111

0x120

0x129 0x129

0x123

spcl.inf.ethz.ch

@spcl_eth


61

Process A

Memory

Process B

Memory

Process C

Memory

2 1 2

0x111

0x120

0x129 0x129

0x123

Get(id)

Process A Process B

2Cached:

2Access the

window

Get(id)

Process A Process B

2Cached:

1

Access thewindow

Update(list)

spcl.inf.ethz.ch

@spcl_eth

84

PERFORMANCE INTER-NODE: LATENCY (SHMEM)

Put Inter-Node Get Inter-Node

Proc 0 Proc 1

put

sync memory

Half ping-pong

spcl.inf.ethz.ch

@spcl_eth

85

PERFORMANCE INTRA-NODE: LATENCY (SHMEM)

Put/Get Intra-NodeProc 0 Proc 1

put

sync memory

Half ping-pong

spcl.inf.ethz.ch

@spcl_eth

BACKOFF: decrement the global exclusive counter

...then retry


Proc 2 wants to lock exclusively Proc 1

Process 0

00000 0

000 000

Process 2

00000 0

Process 1

00000 0

Add(-1)

000 001

61

spcl.inf.ethz.ch

@spcl_eth

PERFORMANCE MODELING

61

Fence 𝒫𝑓𝑒𝑛𝑐𝑒 = 2.9𝜇𝑠 ⋅ log2(𝑝)

PSCW 𝒫𝑠𝑡𝑎𝑟𝑡 = 0.7𝜇𝑠, 𝒫𝑤𝑎𝑖𝑡 = 1.8𝜇𝑠𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 350𝑛𝑠 ⋅ 𝑘

Locks 𝒫𝑙𝑜𝑐𝑘,𝑒𝑥𝑐𝑙 = 5.4𝜇𝑠

𝒫𝑙𝑜𝑐𝑘,𝑠ℎ𝑟𝑑 = 𝒫𝑙𝑜𝑐𝑘_𝑎𝑙𝑙 = 2.7𝜇𝑠

𝒫𝑢𝑛𝑙𝑜𝑐𝑘 = 𝒫𝑢𝑛𝑙𝑜𝑐𝑘_𝑎𝑙𝑙 = 0.4𝜇𝑠𝒫𝑓𝑙𝑢𝑠ℎ = 76𝑛𝑠

𝒫𝑠𝑦𝑛𝑐 = 17𝑛𝑠

Put/get 𝒫𝑝𝑢𝑡 = 0.16𝑛𝑠 ⋅ 𝑠 + 1𝜇𝑠

𝒫𝑔𝑒𝑡 = 0.17𝑛𝑠 ⋅ 𝑠 + 1.9𝜇𝑠

Atomics 𝒫𝑎𝑐𝑐,𝑠𝑢𝑚 = 28𝑛𝑠 ⋅ 𝑠 + 2.4𝜇𝑠

𝒫𝑎𝑐𝑐,𝑚𝑖𝑛 = 0.8𝑛𝑠 ⋅ 𝑠 + 7.3𝜇𝑠

Performance functions for synchronization protocols

Performance functions for communication protocols

@spcl eth Enabling Highly-Scalable Remote Memory Access...

Documents

Transcript of @spcl eth Enabling Highly-Scalable Remote Memory Access...