Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU...
Transcript of Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU...
Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand
Dhabaleswar K. (DK) Panda The Ohio State University
E-mail: [email protected] http://www.cse.ohio-state.edu/~panda
Presentation at GTC 2014
by
• Growth of High Performance Computing (HPC) – Growth in processor performance
• Chip density doubles every 18 months
– Growth in commodity networking • Increase in speed/features + reducing cost
– Growth in accelerators (NVIDIA GPUs)
Current and Next Generation HPC Systems and Applications
2 OSU-GTC-2014
3
Outline
OSU-GTC-2014
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• On going work
• Before CUDA 4: Additional copies – Low performance and low productivity
• After CUDA 4: Host-based pipeline – Unified Virtual Address
– Pipeline CUDA copies with IB transfers
– High performance and high productivity
• After CUDA 5.5: GPUDirect-RDMA support – GPU to GPU direct transfer
– Bypass the host memory
– Hybrid design to avoid PCI bottlenecks
OSU-GTC-2014 4
MVAPICH2-GPU: CUDA-Aware MPI
InfiniBand
GPU
GPU Memor
y
CPU
Chip set
Data Movement on GPU Clusters
CPU CPU QPI
GPU
PCIe
GP
U
GPU
CPU
GPU
IB
Node 0 Node 1 1. Intra-GPU 2. Intra-Socket GPU-GPU 3. Inter-Socket GPU-GPU 4. Inter-Node GPU-GPU 5. Intra-Socket GPU-Host
7. Inter-Node GPU-Host 6. Inter-Socket GPU-Host
Memory buffers
5
8. Inter-Node GPU-GPU with IB adapter on remote socket
and more . . .
• For each path different schemes: Shared_mem, IPC, GDR, pipeline …. • Critical for runtimes to optimize data movement while hiding the complexity
• Connected as PCIe devices – Flexibility but Complexity
OSU-GTC-2014
• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE)
– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002
– MVAPICH2-X (MPI + PGAS), Available since 2012
– Support for NVIDIA GPUs, Available since 2011
– Used by more than 2,150 organizations (HPC Centers, Industry and Universities) in 72 countries
– More than 205,000 downloads from OSU site directly
– Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC
• 11th ranked 74,358-core cluster (Tsubame 2.0) at Tokyo Institute of Technology
• 16th ranked 96,192-core cluster (Pleiades) at NASA
• 105th ranked 16,896-core cluster (Keenland) at GaTech and many others . . .
– Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE)
– http://mvapich.cse.ohio-state.edu 6
MVAPICH2/MVAPICH2-X Software
OSU-GTC-2014
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• On going work
7
Outline
OSU-GTC-2014
• Hybrid design using GPUDirect RDMA
– GPUDirect RDMA and Host-based pipelining
– Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge
• Support for communication using multi-rail
• Support for Mellanox Connect-IB and ConnectX VPI adapters
• Support for RoCE with Mellanox ConnectX VPI adapters
OSU-GTC-2014 8
GPUDirect RDMA (GDR) with CUDA
IB Adapter
System Memory
GPU Memory
GPU
CPU
Chipset
P2P write: 5.2 GB/s P2P read: < 1.0 GB/s
SNB E5-2670
P2P write: 6.4 GB/s P2P read: 3.5 GB/s
IVB E5-2680V2
SNB E5-2670 /
IVB E5-2680V2
S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)
9
Performance of MVAPICH2 with GPUDirect-RDMA: Latency GPU-GPU Internode MPI Latency
0
100
200
300
400
500
600
700
800
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Latency
Message Size (Bytes)
Late
ncy
(us)
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
10 %
0
5
10
15
20
25
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Latency
Message Size (Bytes)
Late
ncy
(us)
67 %
5.49 usec
OSU-GTC-2014
10
Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Bandwidth
Message Size (Bytes)
Band
wid
th (M
B/s)
0
2000
4000
6000
8000
10000
12000
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Bandwidth
Message Size (Bytes)
Band
wid
th (M
B/s)
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
5x
9.8 GB/s
OSU-GTC-2014
11
Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth
Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
GPU-GPU Internode MPI Bi-directional Bandwidth
0
200
400
600
800
1000
1200
1400
1600
1800
2000
1 4 16 64 256 1K 4K
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Small Message Bi-Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
0
5000
10000
15000
20000
25000
8K 32K 128K 512K 2M
1-Rail2-Rail1-Rail-GDR2-Rail-GDR
Large Message Bi-Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
4.3x
19 %
19 GB/s
OSU-GTC-2014
12
Applications-level Benefits: AWP-ODC with MVAPICH2-GPU
Based on MVAPICH2-2.0b, CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch Two nodes, one GPU/node, one Process/GPU
0
50
100
150
200
250
300
350
Platform A Platform B
MV2 MV2-GDR11.6%
15.4%
Platform A: Intel Sandy Bridge + NVIDIA Tesla K20 + Mellanox ConnectX-3
Platform B: Intel Ivy Bridge + NVIDIA Tesla K40 + Mellanox Connect-IB
• A widely-used seismic modeling application, Gordon Bell Finalist at SC 2010
• An initial version using MPI + CUDA for GPU clusters
• Takes advantage of CUDA-aware MPI, two nodes, 1 GPU/Node and 64x32x32 problem
• GPUDirect-RDMA delivers better performance with newer architecture
050
100150200250300350
Platform A Platfrom AGDR
Platform B Platfrom BGDR
Computation Communication
21% 30%
OSU-GTC-2014
13
Continuous Enhancements for Improved Point-to-point Performance
Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
OSU-GTC-2014
0
2
4
6
8
10
12
14
1 4 16 64 256 1K 4K 16K
MV2-2.0b-GDR
Improved
Small Message Latency
Message Size (Bytes)
Late
ncy
(us)
GPU-GPU Internode MPI Latency
• Reduced synchronization and while avoiding expensive copies
31 %
14
Dynamic Tuning for Point-to-point Performance
Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch
OSU-GTC-2014
GPU-GPU Internode MPI Performance
0
100
200
300
400
500
600
128K 512K 2M
Opt_latency
Opt_bw
Opt_dynamic
Latency
Message Size (Bytes)
Late
ncy
(us)
0
1000
2000
3000
4000
5000
6000
1 16 256 4K 64K 1M
Opt_latency
Opt_bw
Opt_dynamic
Bandwidth
Message Size (Bytes)
Bi-B
andw
idth
(MB/
s)
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• Conclusion
15
Outline
OSU-GTC-2014
16
One-sided communication
• Send/Recv semantics incur overheads • Distributed buffer information
• Message matching
• Additional copies or rendezvous exchange
• One-sided communication • Separates synchronization from communication
• Direct mapping over RDMA semantics
• Lower overheads and better overlap
4 bytes Host-Host GPU-GPU
IB send/recv 0.98 1.84
MPI send/recv 1.25 6.95
Table: Latency (half round trip) on SandyBridge nodes with FDR connect-IB
OSU-GTC-2014
17
MPI-3 RMA Support with GPUDirect RDMA
Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin
MPI-3 RMA provides flexible synchronization and completion primitives
0
2
4
6
8
10
12
14
16
18
20
1 4 16 64 256 1K 4K
Send-Recv
Put+ActiveSync
Put+Flush
Small Message Latency
Message Size (Bytes)
Late
ncy
(use
c)
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
1 4 16 64 256 1K 4K
Send-RecvPut+ActiveSync
Small Message Rate
Message Size (Bytes)
Mes
sage
Rat
e (M
sgs/
sec)
OSU-GTC-2014
< 2usec
18
Communication Kernel Evaluation: 3Dstencil and Alltoall
Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores
NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin
0102030405060708090
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
Send/Recv One-sided
3D Stencil with 16 GPU nodes
020406080
100120140
1 4 16 64 256 1K 4K
Lat
ency
(use
c)
Message Size (Bytes)
Send/Recv One-sided
AlltoAll with 16 GPU nodes
OSU-GTC-2014
42% 50%
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• Conclusion
19
Outline
OSU-GTC-2014
Non-contiguous Data Exchange
20
• Multi-dimensional data – Row based organization – Contiguous on one dimension – Non-contiguous on other
dimensions
• Halo data exchange – Duplicate the boundary – Exchange the boundary in each
iteration
Halo data exchange
OSU-GTC-2014
MPI Datatye Processing • Comprehensive support
• Targeted kernels for regular datatypes - vector, subarray, indexed_block
• Generic kernels for all other irregular datatypes
• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels
• Flexible set of parameters for users to tune kernels
• Vector • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_VECTOR_YSIZE
• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE
• MV2_CUDA_KERNEL_SUBARR_XDIM
• MV2_CUDA_KERNEL_SUBARR_YDIM
• MV2_CUDA_KERNEL_SUBARR_ZDIM
• Indexed_block • MV2_CUDA_KERNEL_IDXBLK_XDIM
OSU-GTC-2014 21
22
Application-Level Evaluation (LBMGPU-3D)
• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes
• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory
050
100150200250300350400
8 16 32 64
Tota
l Exe
cutio
n Ti
me
(sec
)
Number of GPUs
MPI MPI-GPU5.6%
8.2% 13.5%
15.5%
3D LBM-CUDA
OSU-GTC-2014
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• Conclusion
23
Outline
OSU-GTC-2014
OSU-GTC-2014 24
More Optimizations!!!
• Topology-detection:
• Avoid the inter-sockets QPI bottlenecks
• Dynamic threshold selection between GDR and host-based transfers
• All these and other features will be available with the next release of MVAPICH2-GDR => coming very soon
• Communication on InfiniBand Clusters with GPUs
• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication
• One-sided Communication
• MPI Datatype Processing
• More Optimizations
• MPI and OpenACC
• Conclusion
25
Outline
OSU-GTC-2014
• OpenACC is gaining popularity • Several sessions during GTC • A set of compiler directives (#pragma) • Offload specific loops or parallelizable sections in code onto accelerators
#pragma acc region {
for(i = 0; i < size; i++) { A[i] = B[i] + C[i]; } } • Routines to allocate/free memory on accelerators
buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);
• Supported for C, C++ and Fortran • Huge list of modifiers – copy, copyout, private, independent, etc..
OpenACC
26 OSU-GTC-2014
• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in
OpenACC 2.0 implementations – MVAPICH2 will detect the device pointer and optimize communication – Delivers the same performance as with CUDA
Using MVPICH2 with the new OpenACC 2.0
27 OSU-GTC-2014
A = malloc(sizeof(int) * N); …...
#pragma acc data copyin(A) { #pragma acc parallel for //compute for loop
MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); }
…… free(A);
How can I get Started with GDR Experimentation? • MVAPICH2-2.0b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/
• System software requirements – Mellanox OFED 2.1
– NVIDIA Driver 331.20 or later
– NVIDIA CUDA Toolkit 5.5
– Plugin for GPUDirect RDMA
(http://www.mellanox.com/page/products_dyn?product_family=116)
• Has optimized designs for point-to-point communication using GDR
• Work under progress for optimizing collective and one-sided communication
• Contact MVAPICH help list with any questions related to the package [email protected]
• MVAPICH2-GDR-RC1 with additional optimizations coming soon!!
28 OSU-GTC-2014
• MVAPICH2 optimizes MPI communication on InfiniBand clusters with GPUs
• Provides optimized designs for point-to-point two-sided and one-sided communication, and datatype processing
• Takes advantage of CUDA features like IPC and GPUDirect RDMA
• Delivers – High performance
– High productivity
With support for latest NVIDIA GPUs and InfiniBand Adapters
29
Conclusions
OSU-GTC-2014
OSU-GTC-2014 30
Acknowledgments Dr. Davide Rossetti and others @NVIDIA
OSU-GTC-2014 31
Want to improve the top500 ranking of your heterogeneous GPU Cluster?
Yes !!
Do not miss our next talk –
S4535 - Accelerating HPL on Heterogeneous Clusters with NVIDIA GPUs
Tuesday, 03/25 (today)
Room LL21A
17:00 – 17:25
Talk on Hybrid HPL for Heterogeneous Clusters
Thank You!
Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/
MVAPICH Web Page http://mvapich.cse.ohio-state.edu/
32 OSU-GTC-2014