@spcl eth Enabling Highly-Scalable Remote Memory Access...
Transcript of @spcl eth Enabling Highly-Scalable Remote Memory Access...
spcl.inf.ethz.ch
@spcl_eth
ROBERT GERSTENBERGER, MACIEJ BESTA, TORSTEN HOEFLER
Enabling Highly-Scalable Remote Memory
Access Programming with MPI-3 One Sided
spcl.inf.ethz.ch
@spcl_eth
MPI-3.0 supports RMA (“MPI One Sided”)
Designed to react to hardware trends
Majority of HPC networks support RDMA
2
MPI-3.0 REMOTE MEMORY ACCESS
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch
@spcl_eth
MPI-3.0 supports RMA (“MPI One Sided”)
Designed to react to hardware trends
Majority of HPC networks support RDMA
3
MPI-3.0 REMOTE MEMORY ACCESS
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch
@spcl_eth
MPI-3.0 supports RMA (“MPI One Sided”)
Designed to react to hardware trends
Majority of HPC networks support RDMA
4
MPI-3.0 REMOTE MEMORY ACCESS
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
spcl.inf.ethz.ch
@spcl_eth
MPI-3.0 supports RMA (“MPI One Sided”)
Designed to react to hardware trends
Majority of HPC networks support RDMA
Communication is „one sided” (no involvement of
destination)
RMA decouples communication & synchronization
Different from message passing
5
MPI-3.0 REMOTE MEMORY ACCESS
[1] http://www.mpi-forum.org/docs/mpi-3.0/mpi30-report.pdf
Proc A Proc B
send
recv
Proc A Proc B
put
two sided one sided
CommunicationCommunication
+
Synchronization Synchronizationsync
spcl.inf.ethz.ch
@spcl_eth
6
PRESENTATION OVERVIEW
5. Application evaluation
1. Overview of three
MPI-3 RMA concepts2. MPI window creation
3. Communication
4. Synchronization
spcl.inf.ethz.ch
@spcl_eth
7
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…Process D (active)
…
spcl.inf.ethz.ch
@spcl_eth
8
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…Process D (active)
…
spcl.inf.ethz.ch
@spcl_eth
9
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…Process D (active)
…
spcl.inf.ethz.ch
@spcl_eth
10
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…Process D (active)
…
spcl.inf.ethz.ch
@spcl_eth
11
MPI-3 RMA COMMUNICATION OVERVIEW
Process A (passive)
Memory
MPI window
Process B (active)
Process C (active)
Put
GetAtomic
Non-atomic
communication
calls (put, get)
Atomic communication calls
(Acc, Get & Acc, CAS, FAO)
Memory
MPI window
…Process D (active)
…
spcl.inf.ethz.ch
@spcl_eth
12
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
spcl.inf.ethz.ch
@spcl_eth
13
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
spcl.inf.ethz.ch
@spcl_eth
14
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
spcl.inf.ethz.ch
@spcl_eth
15
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
spcl.inf.ethz.ch
@spcl_eth
16
MPI-3.0 RMA SYNCHRONIZATION OVERVIEW
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
spcl.inf.ethz.ch
@spcl_eth
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
17
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
spcl.inf.ethz.ch
@spcl_eth
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
18
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
spcl.inf.ethz.ch
@spcl_eth
19
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
Window creation
CommunicationSynchronization
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
Window creation, communication and synchronization
spcl.inf.ethz.ch
@spcl_eth
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
Window creation, communication and synchronization
foMPI, a fully functional MPI-3 RMA implementation
DMAPP: lowest-level networking API for Cray Gemini/Aries systems
XPMEM: a portable Linux kernel module
20
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
spcl.inf.ethz.ch
@spcl_eth
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
Window creation, communication and synchronization
foMPI, a fully functional MPI-3 RMA implementation
DMAPP: lowest-level networking API for Cray Gemini/Aries systems
XPMEM: a portable Linux kernel module
21
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
spcl.inf.ethz.ch
@spcl_eth
Scalable & generic protocols
Can be used on any RDMA network (e.g., OFED/IB)
Window creation, communication and synchronization
foMPI, a fully functional MPI-3 RMA implementation
DMAPP: lowest-level networking API for Cray Gemini/Aries systems
XPMEM: a portable Linux kernel module
22
SCALABLE PROTOCOLS & REFERENCE IMPLEMENTATION
http://spcl.inf.ethz.ch/Research/Parallel_Programming/foMPI
spcl.inf.ethz.ch
@spcl_eth
23
PART 1: SCALABLE WINDOW CREATIONTraditional windows
backwards compatible
(MPI-2)
Time bound: 𝒪 𝑝Memory bound: 𝒪 𝑝
𝑝 = total number
of processes
Process A
Memory
Process B
Memory
Process C
Memory
0x111
0x123
0x120
spcl.inf.ethz.ch
@spcl_eth
24
PART 1: SCALABLE WINDOW CREATIONAllocated windows
𝑝 = total number
of processes
Process A
Memory
Process B
Memory
Process C
Memory
Allows MPI
to allocate memory
Time bound: 𝒪 log 𝑝 (𝑤ℎ𝑝)Memory bound: 𝒪 1
0x123 0x1230x123
spcl.inf.ethz.ch
@spcl_eth
25
PART 1: SCALABLE WINDOW CREATIONDynamic windows
𝑝 = total number
of processes
Process A
Memory
Process B
Memory
Process C
Memory
Local attach/detach
Most flexible
Time bound: 𝒪 𝑝Memory bound: 𝒪 𝑝
0x129 0x129
0x111
0x123
0x120
spcl.inf.ethz.ch
@spcl_eth
Put and Get:
Direct DMAPP put and get operations or
local (blocking) memcpy (XPMEM)
Accumulate:
DMAPP atomic operations for 64 bit types
...or fall back to remote locking protocol
MPI datatype handling with MPITypes library [1]
Fast path for contiguous data transfers of common
intrinisic datatypes (e.g., MPI_DOUBLE)
26
PART 2: COMMUNICATION
[1] Ross, Latham, Gropp, Lusk, Thakur. Processing MPI datatypes outside MPI. EuroMPI/PVM’09
Contiguous memory
MPI_Put
dmapp_put_nbi
Remote
process
…
MPI_Compare
_and_swap
dmapp_
acswap_qw_nbi
Remote
process
…
spcl.inf.ethz.ch
@spcl_eth
27
PERFORMANCE INTER-NODE: LATENCY
Put Inter-Node Get Inter-Node
20%faster
80%faster
Proc 0 Proc 1
put
sync memory
Half ping-pong
spcl.inf.ethz.ch
@spcl_eth
28
PERFORMANCE INTRA-NODE: LATENCY
Put/Get Intra-Node 3xfaster
Proc 0 Proc 1
put
sync memory
Half ping-pong
spcl.inf.ethz.ch
@spcl_eth
29
PERFORMANCE: OVERLAP
Inter-Node Overlap in %
Useful for, e.g., scientific codes:
3D FFT
MILC
Proc 0 Proc 1
put
Sync memory
comp.
AWM-Olsen
seismic
spcl.inf.ethz.ch
@spcl_eth
30
PERFORMANCE: MESSAGE RATE
Intra-NodeInter-Node
Proc 0 Proc 1
puts
Sync memory
...
spcl.inf.ethz.ch
@spcl_eth
31
PERFORMANCE: ATOMICS
64 bit integers
hardware-
accelerated
protocol:
fall back
protocol:
lower latency
higher
bandwidth
proprietary
spcl.inf.ethz.ch
@spcl_eth
32
PART 3: SYNCHRONIZATION
Active
process
Passive
process
Synchroni-
zation
Passive Target Mode
Lock
Lock All
Active Target Mode
Fence
Post/Start/
Complete/Wait
Communi-
cation
spcl.inf.ethz.ch
@spcl_eth
Node 0 Node 1
SCALABLE FENCE IMPLEMENTATION
Collective call
Completes all outstanding memory operations
Proc 2
put
Proc 0 Proc 1 Proc 3
int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;
}
put
put
put
put
put
put
33
spcl.inf.ethz.ch
@spcl_eth
Node 0 Node 1
SCALABLE FENCE IMPLEMENTATION
Collective call
Completes all outstanding memory operations
Proc 2Proc 0 Proc 1 Proc 3
int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;
}
put
put
put
Local completion
(XPMEM)
34
spcl.inf.ethz.ch
@spcl_eth
Node 0 Node 1
SCALABLE FENCE IMPLEMENTATION
Collective call
Completes all outstanding memory operations
Proc 2Proc 0 Proc 1 Proc 3
int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;
}
Local completion
(DMAPP)
35
spcl.inf.ethz.ch
@spcl_eth
Node 0 Node 1
SCALABLE FENCE IMPLEMENTATION
Collective call
Completes all outstanding memory operations
Proc 2Proc 0 Proc 1 Proc 3
int MPI_Win_fence(…) {asm( mfence );dmapp_gsync_wait();MPI_Barrier(...);return MPI_SUCCESS;
}
Global
completion
barrier
36
spcl.inf.ethz.ch
@spcl_eth
37
SCALABLE FENCE PERFORMANCE
Time bound 𝒪 log𝑝
Memory bound 𝒪 1
90%
faster
spcl.inf.ethz.ch
@spcl_eth
38
PSCW SYNCHRONIZATION
post
wait
start
complete
accessepoch
exposureepoch
Proc 0 Proc 1
matchingalgorithm
matchingalgorithm
allows to
access other
processesallows access
from other
processes
Posting
processPuts
…
Starting
process
Puts
…
spcl.inf.ethz.ch
@spcl_eth
39
PSCW SYNCHRONIZATION
post
wait
start
complete
Proc 0 Proc 1
start
complete
start
complete
Proc 2 Proc 3
start
complete
post
wait
Proc 4 Proc 5
spcl.inf.ethz.ch
@spcl_eth
In general, there can be n posting and m
starting processes
In this example there is one posting and
4 starting processes
40
PSCW SCALABLE POST/START MATCHING
Posting process
(opens its window)
j4
j1
j3
j2i
spcl.inf.ethz.ch
@spcl_eth
In general, there can be n posting and m
starting processes
In this example there is one posting and
4 starting processes
41
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
Starting processes
(access remote window)
spcl.inf.ethz.ch
@spcl_eth
Each starting process has a local list
42
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
Local list
spcl.inf.ethz.ch
@spcl_eth
Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
Each starting process j waits
until the rank of the posting
process i is present in its local list
43
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
spcl.inf.ethz.ch
@spcl_eth
Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
Each starting process j waits
until the rank of the posting
process i is present in its local list
44
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
spcl.inf.ethz.ch
@spcl_eth
Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
Each starting process j waits
until the rank of the posting
process i is present in its local list
45
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
spcl.inf.ethz.ch
@spcl_eth
Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
Each starting process j waits
until the rank of the posting
process i is present in its local list
46
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
spcl.inf.ethz.ch
@spcl_eth
Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
Each starting process j waits
until the rank of the posting
process i is present in its local list
47
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
spcl.inf.ethz.ch
@spcl_eth
Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
Each starting process j waits
until the rank of the posting
process i is present in its local list
48
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
i
spcl.inf.ethz.ch
@spcl_eth
Posting process i adds its rank i to a
list at each starting process j1, . . . , j4
Each starting process j waits
until the rank of the posting
process i is present in its local list
49
PSCW SCALABLE POST/START MATCHING
j4
j1
j3
j2i
i
i
i
i
spcl.inf.ethz.ch
@spcl_eth
Each starting process increments a counter stored at the
posting process
50
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
0
spcl.inf.ethz.ch
@spcl_eth
Each starting process increments a counter stored at the
posting process
51
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
1
spcl.inf.ethz.ch
@spcl_eth
Each starting process increments a counter stored at the
posting process
52
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
2
spcl.inf.ethz.ch
@spcl_eth
Each starting process increments a counter stored at the
posting process
53
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
3
spcl.inf.ethz.ch
@spcl_eth
54
PSCW SCALABLE COMPLETE/WAIT MATCHING
ij4
j1
j3
j2
4
Each starting process increments a counter stored at the
posting process
When the counter is equal to the number of starting
processes, the posting process returns from wait
spcl.inf.ethz.ch
@spcl_eth
55
PSCW PERFORMANCE
Time bound
𝒫𝑠𝑡𝑎𝑟𝑡 = 𝒫𝑤𝑎𝑖𝑡 = 𝒪 1𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 𝒪 log 𝑝
Memory bound
𝒪 log 𝑝 (for scalable programs)
Ring Topology
spcl.inf.ethz.ch
@spcl_eth
Two-level lock hierarchy:
56
SCALABLE LOCK SYNCHRONIZATION
Active
process
Passive
processLock/Unlock
(shared/exclusive)
Lock All
(always shared)
Process 0
Memory
00000 0local:
Shared Counter Exclusive Bit
Process 1
Memory
00000 0local:
Shared Counter Exclusive Bit
Process P-1
Memory
00000 0local:
Shared Counter Exclusive Bit
…000 000global:
SharedCounter
ExclusiveCounter
Master Process
spcl.inf.ethz.ch
@spcl_eth
PHASE 1: increment the global exclusive counter(Invariant 1: no global shared lock held concurrently)
EXCLUSIVE LOCAL LOCK: TWO PHASES
Proc 2 wants to lock Proc 1 exclusively
Process 0
00000 0
000 000
Process 2
00000 0
Process 1
00000 0
57
spcl.inf.ethz.ch
@spcl_eth
PHASE 1: increment the global exclusive counter(Invariant 1: no global shared lock held concurrently)
EXCLUSIVE LOCAL LOCK: TWO PHASES
Process 0
00000 0
000 001
Process 2
00000 0
Process 1
00000 0
fetch-add
000 000
MPI_Win_lock( EXCL, 1 )
58
Proc 2 wants to lock Proc 1 exclusively
spcl.inf.ethz.ch
@spcl_eth
PHASE 1: increment the global exclusive counter(Invariant 2: no local shared/exclusive lock held concurrently)
EXCLUSIVE LOCAL LOCK: TWO PHASES
Process 0
00000 0
000 001
Process 2
00000 0
Process 1
00000 1
fetch-add
000 000
MPI_Win_lock( EXCL, 1 )
compare &swap
00000 0
59
Proc 2 wants to lock Proc 1 exclusively
spcl.inf.ethz.ch
@spcl_eth
Increment local shared counter(Invariant: no local exclusive lock on this process held concurrently)
SHARED LOCAL LOCK: ONE PHASE
Proc 0 wants to lock Proc 1
Process 0
00000 0
000 001
Process 2
00000 0
Process 1
00000 0
60
spcl.inf.ethz.ch
@spcl_eth
Increment local shared counter(Invariant: no local exclusive lock on this process held concurrently)
SHARED LOCAL LOCK: ONE PHASE
Proc 0 wants to lock Proc 1
Process 0
00000 0
000 001
Process 2
00000 0
Process 1
00001 0
MPI_Win_lock( SHRD, 1 )
fetch-add
00000 0
61
spcl.inf.ethz.ch
@spcl_eth
Increment global shared counter(Invariant: no local exclusive lock is held concurrently)
SHARED GLOBAL LOCK: ONE PHASE
Proc 2 wants to lock the whole window
Process 0
00000 0
000 000
Process 2
00000 0
Process 1
00000 0
62
spcl.inf.ethz.ch
@spcl_eth
Increment global shared counter(Invariant: no local exclusive lock is held concurrently)
SHARED GLOBAL LOCK: ONE PHASE
Proc 2 wants to lock the whole window
Process 0
00000 0
001 000
Process 2
00000 0
Process 1
00000 0
63
MPI_Win_lock_all()
fetch-add
000 000
Constant number of operations for 𝑝 processes
spcl.inf.ethz.ch
@spcl_eth
Guarantees remote completion
Issues a remote bulk synchronization and an x86 mfence
One of the most performance critical functions, we add only 78 x86
CPU instructions to the critical path
64
FLUSH SYNCHRONIZATION Time bound 𝒪 1
Memory bound 𝒪(1)
Process 0 Process 1
inc(counter)
0counter:
…
inc(counter)
inc(counter)
spcl.inf.ethz.ch
@spcl_eth
Guarantees remote completion
Issues a remote bulk synchronization and an x86 mfence
One of the most performance critical functions, we add only 78 x86
CPU instructions to the critical path
65
FLUSH SYNCHRONIZATION Time bound 𝒪 1
Memory bound 𝒪(1)
Process 0 Process 1
inc(counter)
3counter:
…
inc(counter)
inc(counter)
flush
spcl.inf.ethz.ch
@spcl_eth
66
Evaluation on Blue Waters System
22,640 computing Cray XE6 nodes
724,480 schedulable cores
All microbenchmarks
4 applications
One nearly full-scale run
PERFORMANCE
spcl.inf.ethz.ch
@spcl_eth
67
PERFORMANCE: MOTIF APPLICATIONS
Key/Value Store:
Random Inserts per SecondDynamic Sparse Data Exchange (DSDE)
with 6 neighbors
spcl.inf.ethz.ch
@spcl_eth
68
PERFORMANCE: APPLICATIONS
NAS 3D FFT [1] Performance MILC [2] Application Execution Time
Annotations represent performance gain of foMPI over Cray MPI-1.
[1] Nishtala et al. Scaling communication-intensive applications on BlueGene/P using one-sided communication and overlap. IPDPS’09
[2] Shan et al. Accelerating applications at scale using one-sided communication. PGAS’12
scale
to 512k procsscale
to 65k procs
spcl.inf.ethz.ch
@spcl_eth
69
CONCLUSIONS & SUMMARY
spcl.inf.ethz.ch
@spcl_eth
70
CONCLUSIONS & SUMMARY
1. MPI window creation
routines
spcl.inf.ethz.ch
@spcl_eth
71
CONCLUSIONS & SUMMARY
2. Non-atomic & atomic
communication1. MPI window creation
routines
spcl.inf.ethz.ch
@spcl_eth
72
CONCLUSIONS & SUMMARY
3. Fence / PSCW1. MPI window creation
routines
2. Non-atomic & atomic
communication
spcl.inf.ethz.ch
@spcl_eth
73
CONCLUSIONS & SUMMARY
1. MPI window creation
routines
2. Non-atomic & atomic
communication3. Fence / PSCW
4. Locks
spcl.inf.ethz.ch
@spcl_eth
1. MPI window creation
routines
2. Non-atomic & atomic
communication
74
CONCLUSIONS & SUMMARY
3. Fence / PSCW
4. Locks
5. foMPI reference implementation
spcl.inf.ethz.ch
@spcl_eth
75
CONCLUSIONS & SUMMARY
1. MPI window creation
routines
2. Non-atomic & atomic
communication3. Fence / PSCW
4. Locks
5. foMPI reference implementation
spcl.inf.ethz.ch
@spcl_eth
76
CONCLUSIONS & SUMMARY
1. MPI window creation
routines
2. Non-atomic & atomic
communication3. Fence / PSCW
4. Locks
5. foMPI reference implementation
spcl.inf.ethz.ch
@spcl_eth
Thanks to:
Timo Schneider, Greg Bauer, Bill Kramer, Duncan Roweth,
Nick Wright, Paul Hargrove (and the whole UPC team)
and the MPI Forum RMA WG …
… and the institutions:
77
ACKNOWLEDGMENTS
spcl.inf.ethz.ch
@spcl_eth
Thank you
for your attention
78
http://spcl.inf.ethz.ch/Research/
Parallel_Programming/foMPI
spcl.inf.ethz.ch
@spcl_eth
Backup slides
79
spcl.inf.ethz.ch
@spcl_eth
DYNAMIC WINDOW CREATION
61
Process A
Memory
Process B
Memory
Process C
Memory
0 0 0
spcl.inf.ethz.ch
@spcl_eth
DYNAMIC WINDOW CREATION
61
Process A
Memory
Process B
Memory
Process C
Memory
1 0 1
0x111
0x120
spcl.inf.ethz.ch
@spcl_eth
DYNAMIC WINDOW CREATION
61
Process A
Memory
Process B
Memory
Process C
Memory
2 1 2
0x111
0x120
0x129 0x129
0x123
spcl.inf.ethz.ch
@spcl_eth
DYNAMIC WINDOW CREATION
61
Process A
Memory
Process B
Memory
Process C
Memory
2 1 2
0x111
0x120
0x129 0x129
0x123
Get(id)
Process A Process B
2Cached:
2Access the
window
Get(id)
Process A Process B
2Cached:
1
Access thewindow
Update(list)
spcl.inf.ethz.ch
@spcl_eth
84
PERFORMANCE INTER-NODE: LATENCY (SHMEM)
Put Inter-Node Get Inter-Node
Proc 0 Proc 1
put
sync memory
Half ping-pong
spcl.inf.ethz.ch
@spcl_eth
85
PERFORMANCE INTRA-NODE: LATENCY (SHMEM)
Put/Get Intra-NodeProc 0 Proc 1
put
sync memory
Half ping-pong
spcl.inf.ethz.ch
@spcl_eth
BACKOFF: decrement the global exclusive counter
...then retry
EXCLUSIVE LOCAL LOCK: TWO PHASES
Proc 2 wants to lock exclusively Proc 1
Process 0
00000 0
000 000
Process 2
00000 0
Process 1
00000 0
Add(-1)
000 001
61
spcl.inf.ethz.ch
@spcl_eth
PERFORMANCE MODELING
61
Fence 𝒫𝑓𝑒𝑛𝑐𝑒 = 2.9𝜇𝑠 ⋅ log2(𝑝)
PSCW 𝒫𝑠𝑡𝑎𝑟𝑡 = 0.7𝜇𝑠, 𝒫𝑤𝑎𝑖𝑡 = 1.8𝜇𝑠𝒫𝑝𝑜𝑠𝑡 = 𝒫𝑐𝑜𝑚𝑝𝑙𝑒𝑡𝑒 = 350𝑛𝑠 ⋅ 𝑘
Locks 𝒫𝑙𝑜𝑐𝑘,𝑒𝑥𝑐𝑙 = 5.4𝜇𝑠
𝒫𝑙𝑜𝑐𝑘,𝑠ℎ𝑟𝑑 = 𝒫𝑙𝑜𝑐𝑘_𝑎𝑙𝑙 = 2.7𝜇𝑠
𝒫𝑢𝑛𝑙𝑜𝑐𝑘 = 𝒫𝑢𝑛𝑙𝑜𝑐𝑘_𝑎𝑙𝑙 = 0.4𝜇𝑠𝒫𝑓𝑙𝑢𝑠ℎ = 76𝑛𝑠
𝒫𝑠𝑦𝑛𝑐 = 17𝑛𝑠
Put/get 𝒫𝑝𝑢𝑡 = 0.16𝑛𝑠 ⋅ 𝑠 + 1𝜇𝑠
𝒫𝑔𝑒𝑡 = 0.17𝑛𝑠 ⋅ 𝑠 + 1.9𝜇𝑠
Atomics 𝒫𝑎𝑐𝑐,𝑠𝑢𝑚 = 28𝑛𝑠 ⋅ 𝑠 + 2.4𝜇𝑠
𝒫𝑎𝑐𝑐,𝑚𝑖𝑛 = 0.8𝑛𝑠 ⋅ 𝑠 + 7.3𝜇𝑠
Performance functions for synchronization protocols
Performance functions for communication protocols