Post on 02-Jan-2016
© 2008 IBM Corporation
Deep Computing Messaging Framework
Lightweight Communication for Petascale SupercomputingSupercomputing 2008
Michael Blocksome, blocksom@us.ibm.com
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation2
DCMF Open Source Community
Open source community established January 2008
Wiki– http://dcmf.anl-external.org/wiki
Mailing List– dcmf@lists.anl-external.org
Git Source Repository– helpful git resources on wiki
– git clone http://dcmf.anl-external.org/dcmf.git/
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation3
Design Goals
Scalable to millions of tasks Efficient on low frequency embedded cores
– Inlined system programmer interface (SPI)
Supports many programming paradigms– Active Messages
– Support multiple contexts
– Multiple levels of application interfaces
Structured component design– Extendible to new architectures
– Software architecture for multiple networks
– Open source runtime with external contributions
Separate library for optimized collectives– Hardware acceleration
– Software collectives
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation4
BerkeleyUPC
Application
DMA SPI
DCMF (C++)
MPICH2
DCMF Public API
GlobalArrays
GASNet
Systems Programming Interface
Deep ComputingMessaging Framework
CCMI
Application Layer
Charm++
DM
A S
PI
App
licat
ions
(Q
CD
)
DC
MF
App
licat
ions
ARMCI Library Portability Layer
BG/P Network Hardware
IBM supported software
Externally supported software
IBM® Blue Gene® /P Messaging Software Stack
dcmfd ADI
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation5
Direct DCMF Application Programming
dcmf.h – core interface– point-to-point and utilities
– all functions implemented
collectives interface(s)– may or may not be implemented
– check return value on register!
Collective Component Messaging Interface (CCMI)
– high level collectives library
– uses multisend interface
– extensible to new collectives
BG/P Hardware
dcmf_collectives.h
CCMI
S XS X
S X
DCMFsysd
ep
mes
sage
r
sysd
ep
Device Device
ProtocolProtocol
Protocol
dcmf.h
ProtocolProtocol
Protocol
ProtocolProtocol
Protocol
Device Device
dcmf_globalcollectives.hdcmf_multisend.h
Adaptor
Application
high levelcollectives
multisendcollectives
all point-to-pointglobal
collectives
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation6
DCMF Blue Gene/P Performance
Point-to-Point Collectives on 512 nodes (SMP)
MPI achieves 4300MB/sec (96% of peak) for torus near-neighbor communication on 6 links
Protocol Latency (µs)
DCMF Eager One-way 1.6
MPI Eager One-way 2.4
MPI Rendezvous One-way 5.6
DCMF Put 0.9
DCMF Get 1.6
ARMCI blocking put 2.0
ARMCI blocking get 3.3
Collective Operation Performance
MPI Barrier 1.3us
MPI Allreduce (int sum) 4.3us
MPI Broadcast 4.3us
MPI Allreduce throughput 817 MB/sec
MPI Bcast throughput 2.0 GB/sec
Barriers accelerated via the Global Interrupt network
Allreduce and broadcast operations accelerated via the collective network
Large broadcasts take advantage of the 6 edge-disjoint routes on a 3D torus
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation7
Why use DCMF ?
Scales on BG/P to millions of tasks– high-efficiency, low overhead
Open Source– active community support
Easily port applications and libraries to DCMF interface
Unique features of DCMF– See next chart
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation8
MX VERBS LAPI ELAN DCMF
Multiple Contexts N Y Y Y Y
Active Messages N N1 Y Y Y
One-sided calls N Y Y Y Y
Strided or Vector calls N1 N1 Y Y N2
Multi-send calls N1 N1 N1 N1 Y
Message Ordering and Consistency
N N N N Y
Device interface for many different networks
N Y (C-API) N N Y3 (C++)
Topology Awareness N N N N Y
Architecture Neutral N Y Y N Y
Non-blocking optimized collectives
N1 N1 N1 Blocking Y
1 This feature can be implemented in software on top of the provided set of features in this API, at possibly lower efficiency2 Non-contiguous transfer operation to be added3 Device level programming is available at the protocol level and not the API
Feature Comparison (to the best of our knowledge)
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation9
DCMF C API Features
Multiple Context Registration– supports multiple, concurrent communication paradigms
Memory Consistency– One sided communication APIs like UPC and ARMCI need optimized support for
memory consistency levels
Active Messaging– Good match for Charm++ and other active message runtimes
– MPI can be easily supported
Multisend Protocols– Amortize startup across many messages sent together
Topology Awareness Optimized Protocols See dcmf.h
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation10
Extending DCMF to other Architectures
Copy the “Linux® sockets” messager and build options– Contains sockets device and DCMF_Send () protocol
– Implements core API, returns DCMF_UNIMPL for collectives
New architecture only needs to implement DCMF_Send– Sockets device enables DCMF on Linux clusters
– Shmem device enables DCMF on multi-core systems
DCMF provides default *oversend point-to-point implementations– DCMF_Put ()
– DCMF_Get ()
– DCMF_Control ()
Selectively implement architecture devices and optimized protocols– Assign to DCMF_USER0_SEND_PROTOCOL (for example) to test
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation11
Upcoming Features * (nothing promised)
Common Device Interface (CDI)– Posix Shared Memory
– Sockets
– Infiniband Multi-channel advance
– Thread may advance a “slice” of the messaging devices
– Dedicated threads result in uncontested locks for high-level communication libraries
Add a blocking advance API– Eliminate explicit processor polls on supported hardware
– May degrade to a regular DCMF_Messager_advance() on unsupported hardware Extend API to access Blue Gene® features in portable manner
– network and device structures
– replace hardware struct with key-value Noncontiguous point-to-point one-sided
– iterator can be used to implement all other interfaces (strided, vector, etc) One-sided “on the fly” collectives (ad hoc)
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation12
DCMF Device Abstraction
At the core of DCMF is a “Device”, with a packet API abstraction and a DMA API abstraction
In principle, the functions are virtual, in practice the methods are inlined for performance
– Barton-Nackman C++ templates
Common Device Interface (CDI)– If you implement this interface, you get all of DCMF “for free”
– Good for rapid prototypes
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation13
Current DCMF Devices
Blue Gene/P– DMA / 3-D Torus Network
– Collective Network
– Global Interrupt Network
– Lockbox / Memory Atomics
Generic– Sockets
– hybrid compatable
– Shared Memory
– hybrid compatable
– Infiniband
– hybrid compatable
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation14
Other DCMF Projects
IBM– Roadrunner
Argonne National Laboratory– MPICH2
– ZeptoOS
Pacific Northwest National Laboratory– Global Arrays / ARMCI
Berkeley– UPC / GASNet
University of Illinois at Urbana-Champaign– Charm++
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation15
Open Source Project Ideas, in no particular order
Store-and-Forward protocols Stream API Channel combining, message striping across devices Extend to other process managers (OpenMPI, etc) Extend to other platforms (OS X, BSD, Windows, ?) DCMF functional and performance test suite Scalability improvements for sockets and IB Combination shmem/sockets messager GPU device ? hybrid model Shared memory collectives
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation16
How Can We be a more effective open source project
How to improve open source experience
specific needs, directions?
missing features?
© 2008 IBM Corporation
Additional Charts
DCMF on Linux ClustersDCMF on Infiniband
© 2008 IBM Corporation
DCMF on Linux Clusters
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation19
DCMF on Linux Clusters
Build Instructions on Wiki– http://dcmf.anl-external.org/wiki/index.php/Building_DCMF_for_Linux
Test environment for application developers – Evaluate the DCMF API and runtime
– Port applications to DCMF before reserving time on Blue Gene/P
Uses MPICH2 PMI for job launch and management– Needs pluggable job launch and sysdep extension to remove MPICH2 dependency
Implemented Devices– sockets device
– shmem device
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation20
DCMF Sockets Device
Standard sockets syscalls implemented on many architectures
Uses the “packet” CDI– New “stream” CDI may provide better performance
Current design is not scalable– primarily a development and porting platform
Can be used to initialize other devices that require sychronization
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation21
DCMF Shmem Device
Uses the “packet” CDI
Only point-to-point send
Thread safe, allows multiple threads to post messages to device
No collectives
© 2008 IBM Corporation
DCMF on Infiniband
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation23
DCMF Infiniband Motivations
Optimize for low power processors and big fatties
Infiniband project lead: Charles Archer– communicate via dcmf mailing list
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation24
DCMF Infiniband Device
Implements CDI “rdma” version– direct RDMA
– memregions
Implements CDI “packet” version– “eager” style sends
rdma CDI design– SRQ, scalable – worst latency
packet CDI design– Per destination rdma with send recv
– Per destination rdma with direct DMA – best latency
DCMF BoF, Supercomputing 2008
Deep Computing Messaging Framework | Lightweight Communication for Petascale Supercomputing © 2008 IBM Corporation25
DCMF Infiniband – Future Work
Remove artificial limits on scalability– currently 32 nodes
Implement memregion caching
Multiple adaptor support (?)
Switch management routines (?)
Multiple network implemention– SRQ and “per destination”
Async progress through IB events