Joachim Worringen , Andreas Gäer, Frank Reker
description
Transcript of Joachim Worringen , Andreas Gäer, Frank Reker
![Page 1: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/1.jpg)
Lehrstuhl für Betriebssysteme, RWTH Aachen
Workshop for Communication Architecture in Clusters, IPDPS 2002:
Exploiting Transparent Remote Memory Accessfor
Non-Contiguous- and One-Sided-Communication
Joachim Worringen, Andreas Gäer, Frank Reker
![Page 2: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/2.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Agenda
• Introduction into SCI via PCI
• MPI via SCI: SCI-MPICH
• Non-Contiguous Datatype Communication
• One-sided Communication
• Conclusion
![Page 3: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/3.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
SCI – Principle
• Scalable Coherent Interface:Protocol for a memory-coupling interconnect
• standardized in 1992 (IEEE 1596-1992)
• Not a physical bus – but offers bus-like services
• transparent operation for source and target
• operates within a global 64-bit adress space
• Good scalability through flexible topologies and small packet sizes
• optional: efficient, distributed cache-coherence
![Page 4: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/4.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
SCI – Implementation
• Integrated Systems• PCI-SCI adapter:
• 64-bit 66 MHz PCI-PCI Bridge
• Key performance values on IA-32 platform:- Remote-write (word) 1,5 us
- Remote-read (word) 3,2 us
- Bandwidth PIO 170 MB/s (IA-32 PCI is the limit)
(peak bandwidth for 512 byte blocksize)
DMA 250 MB/s (DMA engine is the limit)
- Synchronization~ 15 us
- Remote interrupt ~ 30 us
![Page 5: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/5.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
PCI-SCI – Peformance on IA-32: Bandwidth
PCIMemory
DMA engine
![Page 6: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/6.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
PCI-SCI – Peformance on IA-32: Latency
Native packet size
![Page 7: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/7.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Message Passing over SCI
Similar to local shared memory, but:• Different performance characteristics
- Read / write latency
- I/O-bus behaves differently than system bus
- Granularity and contiguousity of access important
- Load on independant inter-node links (collective operations)
• Different memory consistency model
• Connecting / Mapping incurs more overhead
• Resources are limited
• It‘s still a network: connection monitoring
Naive approach „map everything and do shmem“ is neither efficient nor scalable
![Page 8: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/8.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Communicating Non-Contiguous Data
![Page 9: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/9.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
MPI Datatypes
• Basic datatypes (char, double, ...)• Derived datatypes
• Concanate elements: communicat vectors
• Associations of elements: communicate structs• Using offsets, strides and upper/lower bounds, non-
contiguous datatypes can be defined• Example: Columns of a matrix (‚C‘ storage)
![Page 10: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/10.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
SenderSender ReceiverReceiver
Sending Non-Contiguous Data –The Generic Way
What is the problem?• Many communication devices have simple data
specification <data location, data length> N send operations for N separate data chunks –
inefficient
Generic solution:
Pack UnpackGatherGatherTransferTransfer
![Page 11: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/11.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
SenderSender ReceiverReceiver
Sending Non-Contiguous Data –The SCI Way
Goal: avoid intermediate copy operations!
SCI Advantage: high bandwidth for small data chunks
Problem: requires fast, space-efficient, restartable
pack/unpack algorithm: direct_pack_ff
Gather & UnpackGather & UnpackPack & TransferPack & Transfer
![Page 12: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/12.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Performance (I) – Simple Vector
![Page 13: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/13.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Performance (II) – Complex Type
![Page 14: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/14.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Performance (III) - Alignment
![Page 15: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/15.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Reduced Cache-Pollution
![Page 16: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/16.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
One-Sided Communication
SCI is one-sided communication, but:• Remote read bandwidth < 10 MB/s
• Only fragments of each process‘ address space are accessable
• These fragments need to be allocated specifically (until now)
Multi-protocol approach required to• Handle every kind of memory type
• Achieve optimal performance for each type of access
![Page 17: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/17.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Different Access Modes
Direct access of remote memory:
• Window is allocated from an SCI shared segment• Put-operations of any size• Get-operations sized up to a threshold
Emulated access of remote memory: Access initiated by origin via messages, executed by target:
• Windows allocated from private memory• Get-Operations beyond threshold• Accumulate-operations
Synchronous or asynchronous completionDifferent overhead & synchronization semantics
![Page 18: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/18.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Latency of Remote Accesses
0
20
40
60
80
100
120
1 5 12 28 44 60 76 92 108
124
number elements ("double")
late
ncy
[µ
s]
put shared
put priv async
put priv sync
get shared
get priv async
get priv sync
![Page 19: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/19.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
1
10
100
1 5 12 28 44 60 76 92 108
124
number elements ("double")
ban
dw
idth
[M
iB/s
] put shared
put priv async
put priv sync
get shared
get priv async
get priv sync
Bandwidth of Remote Accesses
![Page 20: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/20.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Scalability
• SCI ringlet bandwidth: 633 MiB/s• Outgoing peak traffic for MPI_Put() per node: 122 MiB/s• Worst case scalability:
nodes offered load per node [MiB/s] total [MiB/s] efficiency
4 76,3 % 120,7 482,8 76,3 %
5 95,3 % 115,8 579,0 91,5%
6 114,4 % 97,8 586,5 92,7 %
7 133,5 % 79,3 555,1 87,7 %
8 152,5 % 62,8 502,2 79,3 %
![Page 21: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/21.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Conclusions
• Avoiding separate pack/unpack for non-contiguous datatypes• Highly efficient communication• Reduced cache pollution and working set size
• Performance of one-sided communication comparable to integrated systems
• Multi-protocol approach ensures unlimited usability with optimal performance for each scenario
• Techniques are directly usable for SCI and local shmem• Limits (and possible solutions):
• Address space of IA-32 architecture (IA-64 or others)• Limited address translation resources (better hw, or sw cache)• Ringlet scalability (increase link clock)• Remote write latency (PCI-X w/ 133MHz -> down to 0.8 µs)• Remote read-bandwidth (no real solution in sight)• Interrupt latency (interrupt-on-write will halve it)
![Page 22: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/22.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Thank you!
![Page 23: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/23.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Spare Slides
![Page 24: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/24.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Pack on the fly
Store type information on ‚commit‘ of datatype:
offset, length, stride, repetition count• Usual data structure tree: recursive, complex restart• Naive data structure list: not space-efficient• Suitable data structure: linked list of stacks
• Bandwidth break-even point between• Overhead for intermediate-copies
• Reduced bandwidth for fine-grained copy operationsList sorted by size of stack entries („leaves“) to allow for
optimization: buffering for small and misaligned (parts of) leaves
![Page 25: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/25.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Integration in SCI-MPICH
Building the stack:MPIR_build_ff (struct MPIR_DATATYPE *type, int *leaves)
Moving data:MPID_SMI_Pack_ff (char *inbuf, struct MPIR_DATATYPE *dtype_ptr, char *outbuf, int dest, int max, int *outlen)
MPID_SMI_UnPack_ff (char *inbuf,
struct MPIR_DATATYPE *dtype_ptr,
char *outbuf, int from, int max, int *outlen)
![Page 26: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/26.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Type Matching
Sorting of basic datatypes (leafs) by size:• Improves performance by mixing gathering & direct write
• Changes order in incoming buffer at receiver!
![Page 27: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/27.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Efficiency Comparision: Non-contig Vector
0,0
0,2
0,4
0,6
0,8
1,0
1,2
8 32 128
512
2048
8192
3276
8
1310
72
ff via SCI
ff via shmem
Score Myrinet
Score shmem
SunFire Gigabit
SunFire shmem
Cray T3E
![Page 28: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/28.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Memory Allocation
MPI_Alloc_mem():
• Below threshold 1: allocate via malloc()
• Below threshold 2: allocate from shared mem pool
• Beyond threshold 2: create specific shared memory
Attributes:
• Keys private, shared, pinned: type of memory buffer
• Key align with value: align memory buffer
MPI_Free_mem():
• Unpin memory, release shared memory segment Implicitely invoke remote segment callbacks
![Page 29: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/29.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
sparse micro-Benchmark
Simulate access to remote sparse matrix:MPI_Win_create (..., winsize, ...)
for (increasing values of access_cnt) {
offset = 0
stride = access_cnt * sizeof(datatype)
flush_cache()
time = MPI_Wtime()
while (offset + access_size < winsize) {
MPI_Get/Put (.., partner, offset, access_cnt, ..)
offset = offset + stride
}
MPI_Win_fence(...)
time = MPI_Wtime() - time
}
![Page 30: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/30.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Latency of Remote Accesses
![Page 31: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/31.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Bandwidth of Remote Accesses
![Page 32: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/32.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Point-to-Point Comparison
![Page 33: Joachim Worringen , Andreas Gäer, Frank Reker](https://reader035.fdocuments.in/reader035/viewer/2022062518/56814548550346895db217c6/html5/thumbnails/33.jpg)
Workshop CAC ´02 – April 15th 2002 Lehrstuhl für Betriebsysteme
Scaling Comparison MPI_Put