Designing MPI and PGAS Libraries for Exascale Systems: The...
Transcript of Designing MPI and PGAS Libraries for Exascale Systems: The...
Designing MPI and PGAS Libraries for Exascale Systems: The MVAPICH2 Approach
Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: [email protected]
http://www.cse.ohio-state.edu/~panda
Talk at OpenFabrics Workshop (April 2016)
by
OpenFabrics-MVAPICH2 (April ‘16) 2 Network Based Computing Laboratory
Parallel Programming Models Overview
P1 P2 P3
Shared Memory
P1 P2 P3
Memory Memory Memory
P1 P2 P3
Memory Memory Memory Logical shared memory
Shared Memory Model
SHMEM, DSM Distributed Memory Model
MPI (Message Passing Interface)
Partitioned Global Address Space (PGAS)
Global Arrays, UPC, Chapel, X10, CAF, …
• Programming models provide abstract machine models
• Models can be mapped on different types of systems – e.g. Distributed Shared Memory (DSM), MPI within a node, etc.
• PGAS models and Hybrid MPI+PGAS models are gradually receiving importance
OpenFabrics-MVAPICH2 (April ‘16) 3 Network Based Computing Laboratory
Partitioned Global Address Space (PGAS) Models • Key features
- Simple shared memory abstractions
- Light weight one-sided communication
- Easier to express irregular communication
• Different approaches to PGAS
- Languages • Unified Parallel C (UPC)
• Co-Array Fortran (CAF)
• X10
• Chapel
- Libraries • OpenSHMEM
• UPC++
• Global Arrays
OpenFabrics-MVAPICH2 (April ‘16) 4 Network Based Computing Laboratory
Hybrid (MPI+PGAS) Programming
• Application sub-kernels can be re-written in MPI/PGAS based on communication characteristics
• Benefits: – Best of Distributed Computing Model
– Best of Shared Memory Computing Model
• Exascale Roadmap*: – “Hybrid Programming is a practical way to
program exascale systems”
* The International Exascale Software Roadmap, Dongarra, J., Beckman, P. et al., Volume 25, Number 1, 2011, International Journal of High Performance Computer Applications, ISSN 1094-3420
Kernel 1 MPI
Kernel 2 MPI
Kernel 3 MPI
Kernel N MPI
HPC Application
Kernel 2 PGAS
Kernel N PGAS
OpenFabrics-MVAPICH2 (April ‘16) 5 Network Based Computing Laboratory
Overview of the MVAPICH2 Project • High Performance open-source MPI Library for InfiniBand, 10-40Gig/iWARP, and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2011
– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014
– Support for Virtualization (MVAPICH2-Virt), Available since 2015
– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015
– Used by more than 2,550 organizations in 79 countries
– More than 360,000 (> 0.36 million) downloads from the OSU site directly
– Empowering many TOP500 clusters (Nov ‘15 ranking)
• 10th ranked 519,640-core cluster (Stampede) at TACC
• 13th ranked 185,344-core cluster (Pleiades) at NASA
• 25th ranked 76,032-core cluster (Tsubame 2.5) at Tokyo Institute of Technology and many others
– Available with software stacks of many vendors and Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu
• Empowering Top500 systems for over a decade – System-X from Virginia Tech (3rd in Nov 2003, 2,200 processors, 12.25 TFlops) ->
– Stampede at TACC (10th in Nov’15, 519,640 cores, 5.168 Plops)
OpenFabrics-MVAPICH2 (April ‘16) 6 Network Based Computing Laboratory
MVAPICH2 Architecture
High Performance Parallel Programming Models
Message Passing Interface (MPI)
PGAS (UPC, OpenSHMEM, CAF, UPC++)
Hybrid --- MPI + X (MPI + PGAS + OpenMP/Cilk)
High Performance and Scalable Communication Runtime Diverse APIs and Mechanisms
Point-to-point
Primitives
Collectives Algorithms
Energy- Awareness
Remote Memory Access
I/O and File Systems
Fault Tolerance
Virtualization Active Messages
Job Startup Introspection
& Analysis
Support for Modern Networking Technology (InfiniBand, iWARP, RoCE, OmniPath)
Support for Modern Multi-/Many-core Architectures (Intel-Xeon, OpenPower, Xeon-Phi (MIC, KNL*), NVIDIA GPGPU)
Transport Protocols Modern Features
RC XRC UD DC UMR ODP* SR-IOV
Multi Rail
Transport Mechanisms Shared
Memory CMA IVSHMEM
Modern Features
MCDRAM* NVLink* CAPI*
* Upcoming
OpenFabrics-MVAPICH2 (April ‘16) 7 Network Based Computing Laboratory
MVAPICH2 Software Family
Requirements MVAPICH2 Library to use
MPI with IB, iWARP and RoCE MVAPICH2
Advanced MPI, OSU INAM, PGAS and MPI+PGAS with IB and RoCE MVAPICH2-X
MPI with IB & GPU MVAPICH2-GDR
MPI with IB & MIC MVAPICH2-MIC
HPC Cloud with MPI & IB MVAPICH2-Virt
Energy-aware MPI with IB, iWARP and RoCE MVAPICH2-EA
OpenFabrics-MVAPICH2 (April ‘16) 8 Network Based Computing Laboratory
• Released on 03/30/2016
• Major Features and Enhancements
– Based on MPICH-3.1.4
– Support for OpenPower architecture • Optimized inter-node and intra-node communication
– Support for Intel Omni-Path architecture • Thanks to Intel for contributing the patch
• Introduction of a new PSM2 channel for Omni-Path
– Support for RoCEv2
– Architecture detection for PSC Bridges system with Omni-Path
– Enhanced startup performance and reduced memory footprint for storing InfiniBand end-point information with SLURM • Support for shared memory based PMI operations
• Availability of an updated patch from the MVAPICH project website with this support for SLURM installations
– Optimized pt-to-pt and collective tuning for Chameleon InfiniBand systems at TACC/UoC
– Enable affinity by default for TrueScale(PSM) and Omni-Path(PSM2) channels
– Enhanced tuning for shared-memory based MPI_Bcast
– Enhanced debugging support and error messages
– Update to hwloc version 1.11.2
MVAPICH2 2.2rc1
OpenFabrics-MVAPICH2 (April ‘16) 9 Network Based Computing Laboratory
MVAPICH2-X 2.2rc1 • Released on 03/30/2016
• Introducing UPC++ Support – Based on Berkeley UPC++ v0.1
– Introduce UPC++ level support for new scatter collective operation (upcxx_scatter)
– Optimized UPC collectives (improved performance for upcxx_reduce, upcxx_bcast, upcxx_gather, upcxx_allgather, upcxx_alltoall)
• MPI Features – Based on MVAPICH2 2.2rc1 (OFA-IB-CH3 interface)
– Support for OpenPower, Intel Omni-Path, and RoCE v2
• UPC Features – Based on GASNET v1.26
– Support for OpenPower, Intel Omni-Path, and RoCE v2
• OpenSHMEM Features – Support for OpenPower and RoCE v2
• CAF Features – Support for RoCE v2
• Hybrid Program Features – Introduce support for hybrid MPI+UPC++ applications
– Support OpenPower architecture for hybrid MPI+UPC and MPI+OpenSHMEM applications
• Unified Runtime Features – Based on MVAPICH2 2.2rc1 (OFA-IB-CH3 interface). All the runtime features enabled by default in OFA-IB-CH3 and OFA-IB-RoCE interface of MVAPICH2 2.2rc1 are available in MVAPICH2-X 2.2rc1
– Introduce support for UPC++ and MPI+UPC++ programming models
• Support for OSU InfiniBand Network Analysis and Management (OSU INAM) Tool v0.9
– Capability to profile and report process to node communication matrix for MPI processes at user specified granularity in conjunction with OSU INAM
– Capability to classify data flowing over a network link at job level and process level granularity in conjunction with OSU INAM
OpenFabrics-MVAPICH2 (April ‘16) 10 Network Based Computing Laboratory
• Scalability for million to billion processors – Support for highly-efficient inter-node and intra-node communication (both two-sided and one-sided
RMA) – Support for advanced IB mechanisms (UMR and ODP) – Extremely minimal memory footprint
• Collective communication • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM, MPI +
UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
OpenFabrics-MVAPICH2 (April ‘16) 11 Network Based Computing Laboratory
Latency & Bandwidth: MPI over IB with MVAPICH2
0.000.200.400.600.801.001.201.401.601.802.00 Small Message Latency
Message Size (bytes)
Late
ncy
(us)
1.26 1.19
0.95 1.15
TrueScale-QDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-3-FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch
ConnectIB-Dual FDR - 2.8 GHz Deca-core (IvyBridge) Intel PCI Gen3 with IB switch ConnectX-4-EDR - 2.8 GHz Deca-core (Haswell) Intel PCI Gen3 Back-to-back
0
2000
4000
6000
8000
10000
12000
14000TrueScale-QDR
ConnectX-3-FDR
ConnectIB-DualFDR
ConnectX-4-EDR
Unidirectional Bandwidth
Band
wid
th
(MBy
tes/
sec)
Message Size (bytes)
12465
3387
6356
12104
OpenFabrics-MVAPICH2 (April ‘16) 12 Network Based Computing Laboratory
0
0.5
1
0 1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(us)
Message Size (Bytes)
Latency Intra-Socket Inter-Socket
MVAPICH2 Two-Sided Intra-Node Performance (Shared memory and Kernel-based Zero-copy Support (LiMIC and CMA))
Latest MVAPICH2 2.2rc1
Intel Ivy-bridge 0.18 us
0.45 us
0
5000
10000
15000
Band
wid
th (M
B/s)
Message Size (Bytes)
Bandwidth (Inter-socket) inter-Socket-CMAinter-Socket-Shmeminter-Socket-LiMIC
0
5000
10000
15000
Band
wid
th (M
B/s)
Message Size (Bytes)
Bandwidth (Intra-socket) intra-Socket-CMAintra-Socket-Shmemintra-Socket-LiMIC
14,250 MB/s 13,749 MB/s
OpenFabrics-MVAPICH2 (April ‘16) 13 Network Based Computing Laboratory
• Introduced by Mellanox to support direct local and remote noncontiguous memory access
– Avoid packing at sender and unpacking at receiver
• Available since MVAPICH2-X 2.2b
User-mode Memory Registration (UMR)
050
100150200250300350
4K 16K 64K 256K 1M
Late
ncy
(us)
Message Size (Bytes)
Small & Medium Message Latency UMRDefault
0
5000
10000
15000
20000
2M 4M 8M 16M
Late
ncy
(us)
Message Size (Bytes)
Large Message Latency UMRDefault
Connect-IB (54 Gbps): 2.8 GHz Dual Ten-core (IvyBridge) Intel PCI Gen3 with Mellanox IB FDR switch M. Li, H. Subramoni, K. Hamidouche, X. Lu and D. K. Panda, High Performance MPI Datatype Support with User-mode Memory Registration: Challenges, Designs and Benefits, CLUSTER, 2015
OpenFabrics-MVAPICH2 (April ‘16) 14 Network Based Computing Laboratory
• Introduced by Mellanox to support direct remote memory access without pinning
• Memory regions paged in/out dynamically by the HCA/OS
• Size of registered buffers can be larger than physical memory
• Will be available in future MVAPICH2 release
On-Demand Paging (ODP)
Connect-IB (54 Gbps): 2.6 GHz Dual Octa-core (SandyBridge) Intel PCI Gen3 with Mellanox IB FDR switch
0
500
1000
1500
16 32 64
Pin-
dow
n Bu
ffer S
ize
(MB)
Number of Processes
Graph500 Pin-down Buffer Sizes Pin-down ODP
0
1
2
3
4
5
16 32 64
Exec
utio
n Ti
me
(s)
Number of Processes
Graph500 BFS Kernel Pin-down ODP
OpenFabrics-MVAPICH2 (April ‘16) 15 Network Based Computing Laboratory
Minimizing Memory Footprint by Direct Connect (DC) Transport
Nod
e 0 P1
P0
Node 1
P3
P2 Node 3
P7
P6
Nod
e 2 P5
P4
IB Network
• Constant connection cost (One QP for any peer)
• Full Feature Set (RDMA, Atomics etc)
• Separate objects for send (DC Initiator) and receive (DC Target)
– DC Target identified by “DCT Number” – Messages routed with (DCT Number, LID) – Requires same “DC Key” to enable communication
• Available since MVAPICH2-X 2.2a
0
0.5
1
160 320 620Nor
mal
ized
Exe
cutio
n Ti
me
Number of Processes
NAMD - Apoa1: Large data set RC DC-Pool UD XRC
10 22
47 97
1 1 1 2
10 10 10 10
1 1 3
5
1
10
100
80 160 320 640
Conn
ectio
n M
emor
y (K
B)
Number of Processes
Memory Footprint for Alltoall RC DC-Pool UD XRC
H. Subramoni, K. Hamidouche, A. Venkatesh, S. Chakraborty and D. K. Panda, Designing MPI Library with Dynamic Connected Transport (DCT) of InfiniBand : Early Experiences. IEEE International Supercomputing Conference (ISC ’14)
OpenFabrics-MVAPICH2 (April ‘16) 16 Network Based Computing Laboratory
• Scalability for million to billion processors • Collective communication
– Offload and Non-blocking • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
OpenFabrics-MVAPICH2 (April ‘16) 17 Network Based Computing Laboratory
Modified HPL with Offload-Bcast does up to 4.5% better than default version (512 Processes)
012345
512 600 720 800
Appl
icat
ion
Run-
Tim
e (s
)
Data Size
05
1015
64 128 256 512Run-
Tim
e (s
)
Number of Processes
PCG-Default Modified-PCG-Offload
Co-Design with MPI-3 Non-Blocking Collectives and Collective Offload Co-Direct Hardware (Available since MVAPICH2-X 2.2a)
Modified P3DFFT with Offload-Alltoall does up to 17% better than default version (128 Processes)
K. Kandalla, et. al.. High-Performance and Scalable Non-Blocking All-to-All with Collective Offload on InfiniBand Clusters: A Study with Parallel 3D FFT, ISC 2011
17%
00.20.40.60.8
11.2
10 20 30 40 50 60 70
Nor
mal
ized
Pe
rfor
man
ce
HPL-Offload HPL-1ring HPL-Host
HPL Problem Size (N) as % of Total Memory
4.5%
Modified Pre-Conjugate Gradient Solver with Offload-Allreduce does up to 21.8% better than default version
K. Kandalla, et. al, Designing Non-blocking Broadcast with Collective Offload on InfiniBand Clusters: A Case Study with HPL, HotI 2011 K. Kandalla, et. al., Designing Non-blocking Allreduce with Collective Offload on InfiniBand Clusters: A Case Study with Conjugate Gradient Solvers, IPDPS ’12
21.8%
Can Network-Offload based Non-Blocking Neighborhood MPI Collectives Improve Communication Overheads of Irregular Graph Algorithms? K. Kandalla, A. Buluc, H. Subramoni, K. Tomko, J. Vienne, L. Oliker, and D. K. Panda, IWPAPS’ 12
OpenFabrics-MVAPICH2 (April ‘16) 18 Network Based Computing Laboratory
• Scalability for million to billion processors • Collective communication
– Offload and Non-blocking • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
OpenFabrics-MVAPICH2 (April ‘16) 19 Network Based Computing Laboratory
MVAPICH2-X for Advanced MPI and Hybrid MPI + PGAS Applications
MPI, OpenSHMEM, UPC, CAF, UPC++ or Hybrid (MPI + PGAS) Applications
Unified MVAPICH2-X Runtime
InfiniBand, RoCE, iWARP
OpenSHMEM Calls MPI Calls UPC Calls
• Unified communication runtime for MPI, UPC, OpenSHMEM, CAF, UPC++ available with MVAPICH2-X 1.9 onwards! (since 2012)
• UPC++ support is available in MVAPICH2-X 2.2RC1 • Feature Highlights
– Supports MPI(+OpenMP), OpenSHMEM, UPC, CAF, UPC++, MPI(+OpenMP) + OpenSHMEM, MPI(+OpenMP) + UPC
– MPI-3 compliant, OpenSHMEM v1.0 standard compliant, UPC v1.2 standard compliant (with initial support for UPC 1.3), CAF 2008 standard (OpenUH), UPC++
– Scalable Inter-node and intra-node communication – point-to-point and collectives
CAF Calls UPC++ Calls
OpenFabrics-MVAPICH2 (April ‘16) 20 Network Based Computing Laboratory
Application Level Performance with Graph500 and Sort Graph500 Execution Time
J. Jose, S. Potluri, K. Tomko and D. K. Panda, Designing Scalable Graph500 Benchmark with Hybrid MPI+OpenSHMEM Programming Models, International Supercomputing Conference (ISC’13), June 2013
J. Jose, K. Kandalla, M. Luo and D. K. Panda, Supporting Hybrid MPI and OpenSHMEM over InfiniBand: Design and Performance Evaluation, Int'l Conference on Parallel Processing (ICPP '12), September 2012
05
101520253035
4K 8K 16K
Tim
e (s
)
No. of Processes
MPI-SimpleMPI-CSCMPI-CSRHybrid (MPI+OpenSHMEM)
13X
7.6X
• Performance of Hybrid (MPI+ OpenSHMEM) Graph500 Design • 8,192 processes - 2.4X improvement over MPI-CSR - 7.6X improvement over MPI-Simple • 16,384 processes - 1.5X improvement over MPI-CSR - 13X improvement over MPI-Simple
J. Jose, K. Kandalla, S. Potluri, J. Zhang and D. K. Panda, Optimizing Collective Communication in OpenSHMEM, Int'l Conference on Partitioned Global Address Space Programming Models (PGAS '13), October 2013.
Sort Execution Time
0500
10001500200025003000
500GB-512 1TB-1K 2TB-2K 4TB-4K
Tim
e (s
econ
ds)
Input Data - No. of Processes
MPI Hybrid
51%
• Performance of Hybrid (MPI+OpenSHMEM) Sort Application
• 4,096 processes, 4 TB Input Size - MPI – 2408 sec; 0.16 TB/min - Hybrid – 1172 sec; 0.36 TB/min
- 51% improvement over MPI-design
OpenFabrics-MVAPICH2 (April ‘16) 21 Network Based Computing Laboratory
• Scalability for million to billion processors • Collective communication
– Offload and Non-blocking • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
OpenFabrics-MVAPICH2 (April ‘16) 22 Network Based Computing Laboratory
At Sender: At Receiver: MPI_Recv(r_devbuf, size, …);
inside MVAPICH2
• Standard MPI interfaces used for unified data movement
• Takes advantage of Unified Virtual Addressing (>= CUDA 4.0)
• Overlaps data movement from GPU with RDMA transfers
High Performance and High Productivity
MPI_Send(s_devbuf, size, …);
GPU-Aware (CUDA-Aware) MPI Library: MVAPICH2-GPU
OpenFabrics-MVAPICH2 (April ‘16) 23 Network Based Computing Laboratory
CUDA-Aware MPI: MVAPICH2-GDR 1.8-2.2 Releases • Support for MPI communication from NVIDIA GPU device memory • High performance RDMA-based inter-node point-to-point
communication (GPU-GPU, GPU-Host and Host-GPU) • High performance intra-node point-to-point communication for multi-
GPU adapters/node (GPU-GPU, GPU-Host and Host-GPU) • Taking advantage of CUDA IPC (available since CUDA 4.1) in intra-node
communication for multiple GPU adapters/node • Optimized and tuned collectives for GPU device buffers • MPI datatype support for point-to-point and collective communication
from GPU device buffers
OpenFabrics-MVAPICH2 (April ‘16) 24 Network Based Computing Laboratory
MVAPICH2-GDR-2.2b Intel Ivy Bridge (E5-2680 v2) node - 20 cores
NVIDIA Tesla K40c GPU Mellanox Connect-IB Dual-FDR HCA
CUDA 7 Mellanox OFED 2.4 with GPU-Direct-RDMA
10x 2X
11x
2x
Performance of MVAPICH2-GPU with GPU-Direct RDMA (GDR)
05
1015202530
0 2 8 32 128 512 2K
MV2-GDR2.2b MV2-GDR2.0bMV2 w/o GDR
GPU-GPU internode latency
Message Size (bytes)
Late
ncy
(us)
2.18us 0
50010001500200025003000
1 4 16 64 256 1K 4K
MV2-GDR2.2bMV2-GDR2.0bMV2 w/o GDR
GPU-GPU Internode Bandwidth
Message Size (bytes)
Band
wid
th
(MB/
s) 11X
01000200030004000
1 4 16 64 256 1K 4K
MV2-GDR2.2bMV2-GDR2.0bMV2 w/o GDR
GPU-GPU Internode Bi-Bandwidth
Message Size (bytes)
Bi-B
andw
idth
(MB/
s)
OpenFabrics-MVAPICH2 (April ‘16) 25 Network Based Computing Laboratory
• Platform: Wilkes (Intel Ivy Bridge + NVIDIA Tesla K20c + Mellanox Connect-IB) • HoomdBlue Version 1.0.5
• GDRCOPY enabled: MV2_USE_CUDA=1 MV2_IBA_HCA=mlx5_0 MV2_IBA_EAGER_THRESHOLD=32768 MV2_VBUF_TOTAL_SIZE=32768 MV2_USE_GPUDIRECT_LOOPBACK_LIMIT=32768 MV2_USE_GPUDIRECT_GDRCOPY=1 MV2_USE_GPUDIRECT_GDRCOPY_LIMIT=16384
Application-Level Evaluation (HOOMD-blue)
0
500
1000
1500
2000
2500
4 8 16 32
Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
MV2 MV2+GDR
0500
100015002000250030003500
4 8 16 32Aver
age
Tim
e St
eps p
er
seco
nd (T
PS)
Number of Processes
64K Particles 256K Particles
2X 2X
OpenFabrics-MVAPICH2 (April ‘16) 26 Network Based Computing Laboratory
• Introduced CUDA-aware OpenSHMEM
• GDR for small/medium message sizes
• Host-staging for large message to avoid PCIe bottlenecks
• Hybrid design brings best of both
• 3.13 us Put latency for 4B (7X improvement ) and 4.7 us latency for 4KB
• 9X improvement for Get latency of 4B
Exploiting GDR for OpenSHMEM
05
10152025
1 4 16 64 256 1K 4K
Host-PipelineGDR
Small Message shmem_put D-D
Message Size (bytes)
Late
ncy
(us)
05
101520253035
1 4 16 64 256 1K 4K
Host-PipelineGDR
Small Message shmem_get D-D
Message Size (bytes)
Late
ncy
(us)
7X
9X K. Hamidouche, A. Venkatesh, A. Awan, H. Subramoni, C. Ching and D. K. Panda, Exploiting GPUDirect RDMA in Designing High Performance OpenSHMEM for GPU Clusters. IEEE Cluster 2015. Also accepted for a special issue of Journal PARCO
Will be available in future MVAPICH2-X Release
0
50
100
150
200
8 16 32 64
Evol
utio
n Ti
me
(mse
c)
Number of GPU Nodes
Weak Scaling
Host-Pipeline Enhanced-GDR
GPULBM: 64x64x64
45 %
OpenFabrics-MVAPICH2 (April ‘16) 27 Network Based Computing Laboratory
• Scalability for million to billion processors • Collective communication
– Offload and Non-blocking • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
OpenFabrics-MVAPICH2 (April ‘16) 28 Network Based Computing Laboratory
MVAPICH2-MIC 2.0 Design for Clusters with IB and MIC • Offload Mode
• Intranode Communication
• Coprocessor-only and Symmetric Mode
• Internode Communication
• Coprocessors-only and Symmetric Mode
• Multi-MIC Node Configurations
• Running on three major systems
• Stampede, Blueridge (Virginia Tech) and Beacon (UTK)
OpenFabrics-MVAPICH2 (April ‘16) 29 Network Based Computing Laboratory
MIC-Remote-MIC P2P Communication with Proxy-based Communication
Bandwidth
Better
Bett
er
Bett
er
Latency (Large Messages)
010002000300040005000
8K 32K 128K 512K 2MLat
ency
(use
c)
Message Size (Bytes)
0
2000
4000
6000
1 16 256 4K 64K 1MBand
wid
th (M
B/se
c)
Message Size (Bytes)
5236
Intra-socket P2P
Inter-socket P2P
0
5000
10000
15000
8K 32K 128K 512K 2M
Lat
ency
(use
c)
Message Size (Bytes)
Latency (Large Messages)
0
2000
4000
6000
1 16 256 4K 64K 1MBand
wid
th (M
B/se
c)
Message Size (Bytes) Better
5594
Bandwidth
OpenFabrics-MVAPICH2 (April ‘16) 30 Network Based Computing Laboratory
Optimized MPI Collectives for MIC Clusters (Allgather & Alltoall)
A. Venkatesh, S. Potluri, R. Rajachandrasekar, M. Luo, K. Hamidouche and D. K. Panda - High Performance Alltoall and Allgather designs for InfiniBand MIC Clusters; IPDPS’14, May 2014
0
10000
20000
30000
1 2 4 8 16 32 64 128 256 512 1K
Late
ncy
(use
cs)
Message Size (Bytes)
32-Node-Allgather (16H + 16 M) Small Message Latency MV2-MIC
MV2-MIC-Opt
0
500
1000
1500
8K 16K 32K 64K 128K 256K 512K 1M
Late
ncy
(use
cs)
Message Size (Bytes)
32-Node-Allgather (8H + 8 M) Large Message Latency MV2-MIC
MV2-MIC-Opt
0
500
1000
4K 8K 16K 32K 64K 128K 256K 512K
Late
ncy
(use
cs)
Message Size (Bytes)
32-Node-Alltoall (8H + 8 M) Large Message Latency MV2-MIC
MV2-MIC-Opt
0
20
40
60
MV2-MIC-Opt MV2-MICExec
utio
n Ti
me
(sec
s)
32 Nodes (8H + 8M), Size = 2K*2K*1K
P3DFFT Performance CommunicationComputation
76% 58%
55%
OpenFabrics-MVAPICH2 (April ‘16) 31 Network Based Computing Laboratory
• Scalability for million to billion processors • Collective communication
– Offload and Non-blocking • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
OpenFabrics-MVAPICH2 (April ‘16) 32 Network Based Computing Laboratory
• MVAPICH2-EA 2.1 (Energy-Aware) • A white-box approach • New Energy-Efficient communication protocols for pt-pt and collective operations • Intelligently apply the appropriate Energy saving techniques • Application oblivious energy saving
• OEMT • A library utility to measure energy consumption for MPI applications • Works with all MPI runtimes • PRELOAD option for precompiled applications • Does not require ROOT permission:
• A safe kernel module to read only a subset of MSRs
Energy-Aware MVAPICH2 & OSU Energy Management Tool (OEMT)
OpenFabrics-MVAPICH2 (April ‘16) 33 Network Based Computing Laboratory
• An energy efficient runtime that provides energy savings without application knowledge
• Uses automatically and transparently the best energy lever
• Provides guarantees on maximum degradation with 5-41% savings at <= 5% degradation
• Pessimistic MPI applies energy reduction lever to each MPI call
MVAPICH2-EA: Application Oblivious Energy-Aware-MPI (EAM)
A Case for Application-Oblivious Energy-Efficient MPI Runtime A. Venkatesh, A. Vishnu, K. Hamidouche, N. Tallent, D.
K. Panda, D. Kerbyson, and A. Hoise, Supercomputing ‘15, Nov 2015 [Best Student Paper Finalist]
1
OpenFabrics-MVAPICH2 (April ‘16) 34 Network Based Computing Laboratory
• Scalability for million to billion processors • Collective communication
– Offload and Non-blocking • Unified Runtime for Hybrid MPI+PGAS programming (MPI + OpenSHMEM,
MPI + UPC, CAF, …) • Integrated Support for GPGPUs • Integrated Support for MICs • Energy-Awareness • InfiniBand Network Analysis and Monitoring (INAM)
Overview of A Few Challenges being Addressed by the MVAPICH2 Project for Exascale
OpenFabrics-MVAPICH2 (April ‘16) 35 Network Based Computing Laboratory
Overview of OSU INAM • A network monitoring and analysis tool that is capable of analyzing traffic on the InfiniBand network
with inputs from the MPI runtime – http://mvapich.cse.ohio-state.edu/tools/osu-inam/
– http://mvapich.cse.ohio-state.edu/userguide/osu-inam/
• Monitors IB clusters in real time by querying various subnet management entities and gathering input from the MPI runtimes
• Capability to analyze and profile node-level, job-level and process-level activities for MPI communication (Point-to-Point, Collectives and RMA)
• Ability to filter data based on type of counters using “drop down” list
• Remotely monitor various metrics of MPI processes at user specified granularity
• "Job Page" to display jobs in ascending/descending order of various performance metrics in conjunction with MVAPICH2-X
• Visualize the data transfer happening in a “live” or “historical” fashion for entire network, job or set of nodes
OpenFabrics-MVAPICH2 (April ‘16) 36 Network Based Computing Laboratory
OSU INAM – Network Level View
• Show network topology of large clusters • Visualize traffic pattern on different links • Quickly identify congested links/links in error state • See the history unfold – play back historical state of the network
Full Network (152 nodes) Zoomed-in View of the Network
OpenFabrics-MVAPICH2 (April ‘16) 37 Network Based Computing Laboratory
OSU INAM – Job and Node Level Views
Visualizing a Job (5 Nodes) Finding Routes Between Nodes
• Job level view • Show different network metrics (load, error, etc.) for any live job • Play back historical data for completed jobs to identify bottlenecks
• Node level view provides details per process or per node • CPU utilization for each rank/node • Bytes sent/received for MPI operations (pt-to-pt, collective, RMA) • Network metrics (e.g. XmitDiscard, RcvError) per rank/node
OpenFabrics-MVAPICH2 (April ‘16) 38 Network Based Computing Laboratory
MVAPICH2 – Plans for Exascale • Performance and Memory scalability toward 1M cores • Hybrid programming (MPI + OpenSHMEM, MPI + UPC, MPI + CAF …)
• MPI + Task* • Enhanced Optimization for GPU Support and Accelerators • Taking advantage of advanced features of Mellanox InfiniBand
• On-Demand Paging (ODP)* • Switch-IB2 SHArP* • GID-based support*
• Enhanced communication schemes for upcoming architectures • Knights Landing with MCDRAM* • NVLINK* • CAPI*
• Extended topology-aware collectives • Extended Energy-aware designs and Virtualization Support • Extended Support for MPI Tools Interface (as in MPI 3.0) • Extended Checkpoint-Restart and migration support with SCR • Support for * features will be available in future MVAPICH2 Releases
OpenFabrics-MVAPICH2 (April ‘16) 39 Network Based Computing Laboratory
Funding Acknowledgments Funding Support by
Equipment Support by
OpenFabrics-MVAPICH2 (April ‘16) 40 Network Based Computing Laboratory
Personnel Acknowledgments Current Students
– A. Augustine (M.S.)
– A. Awan (Ph.D.)
– S. Chakraborthy (Ph.D.)
– C.-H. Chu (Ph.D.)
– N. Islam (Ph.D.)
– M. Li (Ph.D.)
Past Students – P. Balaji (Ph.D.)
– S. Bhagvat (M.S.)
– A. Bhat (M.S.)
– D. Buntinas (Ph.D.)
– L. Chai (Ph.D.)
– B. Chandrasekharan (M.S.)
– N. Dandapanthula (M.S.)
– V. Dhanraj (M.S.)
– T. Gangadharappa (M.S.)
– K. Gopalakrishnan (M.S.)
– G. Santhanaraman (Ph.D.)
– A. Singh (Ph.D.)
– J. Sridhar (M.S.)
– S. Sur (Ph.D.)
– H. Subramoni (Ph.D.)
– K. Vaidyanathan (Ph.D.)
– A. Vishnu (Ph.D.)
– J. Wu (Ph.D.)
– W. Yu (Ph.D.)
Past Research Scientist – S. Sur
Current Post-Doc – J. Lin
– D. Banerjee
Current Programmer – J. Perkins
Past Post-Docs – H. Wang
– X. Besseron
– H.-W. Jin
– M. Luo
– W. Huang (Ph.D.)
– W. Jiang (M.S.)
– J. Jose (Ph.D.)
– S. Kini (M.S.)
– M. Koop (Ph.D.)
– R. Kumar (M.S.)
– S. Krishnamoorthy (M.S.)
– K. Kandalla (Ph.D.)
– P. Lai (M.S.)
– J. Liu (Ph.D.)
– M. Luo (Ph.D.)
– A. Mamidala (Ph.D.)
– G. Marsh (M.S.)
– V. Meshram (M.S.)
– A. Moody (M.S.)
– S. Naravula (Ph.D.)
– R. Noronha (Ph.D.)
– X. Ouyang (Ph.D.)
– S. Pai (M.S.)
– S. Potluri (Ph.D.) – R. Rajachandrasekar (Ph.D.)
– K. Kulkarni (M.S.)
– M. Rahman (Ph.D.)
– D. Shankar (Ph.D.)
– A. Venkatesh (Ph.D.)
– J. Zhang (Ph.D.)
– E. Mancini
– S. Marcarelli
– J. Vienne
Current Research Scientists Current Senior Research Associate – H. Subramoni
– X. Lu
Past Programmers – D. Bureddy
- K. Hamidouche
Current Research Specialist – M. Arnold
OpenFabrics-MVAPICH2 (April ‘16) 41 Network Based Computing Laboratory
Thank You!
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
The MVAPICH2 Project http://mvapich.cse.ohio-state.edu/