Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU...

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Dhabaleswar K. (DK) Panda The Ohio State University

E-mail: [email protected] http://www.cse.ohio-state.edu/~panda

Presentation at GTC 2014

by

http://www.cse.ohio-state.edu/%7Epanda

• Growth of High Performance Computing (HPC) – Growth in processor performance

• Chip density doubles every 18 months

– Growth in commodity networking • Increase in speed/features + reducing cost

– Growth in accelerators (NVIDIA GPUs)

Current and Next Generation HPC Systems and Applications

2 OSU-GTC-2014

3

Outline

OSU-GTC-2014

• Communication on InfiniBand Clusters with GPUs

• MVAPICH2-GPU with GPUDirect-RDMA (GDR) • Two-sided Communication

• One-sided Communication

• MPI Datatype Processing

• More Optimizations

• MPI and OpenACC

• On going work

• Before CUDA 4: Additional copies – Low performance and low productivity

• After CUDA 4: Host-based pipeline – Unified Virtual Address

– Pipeline CUDA copies with IB transfers

– High performance and high productivity

• After CUDA 5.5: GPUDirect-RDMA support – GPU to GPU direct transfer

– Bypass the host memory

– Hybrid design to avoid PCI bottlenecks

OSU-GTC-2014 4

MVAPICH2-GPU: CUDA-Aware MPI

InfiniBand

GPU

GPU Memor

y

CPU

Chip set

Data Movement on GPU Clusters

CPU CPU QPI

GPU

PCIe

GP

U

GPU

CPU

GPU

IB

Node 0 Node 1 1. Intra-GPU 2. Intra-Socket GPU-GPU 3. Inter-Socket GPU-GPU 4. Inter-Node GPU-GPU 5. Intra-Socket GPU-Host

7. Inter-Node GPU-Host 6. Inter-Socket GPU-Host

Memory buffers

5

8. Inter-Node GPU-GPU with IB adapter on remote socket

and more . . .

• For each path different schemes: Shared_mem, IPC, GDR, pipeline …. • Critical for runtimes to optimize data movement while hiding the complexity

• Connected as PCIe devices – Flexibility but Complexity

OSU-GTC-2014

• High Performance open-source MPI Library for InfiniBand, 10Gig/iWARP and RDMA over Converged Enhanced Ethernet (RoCE)

– MVAPICH (MPI-1) ,MVAPICH2 (MPI-2.2 and MPI-3.0), Available since 2002

– MVAPICH2-X (MPI + PGAS), Available since 2012

– Support for NVIDIA GPUs, Available since 2011

– Used by more than 2,150 organizations (HPC Centers, Industry and Universities) in 72 countries

– More than 205,000 downloads from OSU site directly

– Empowering many TOP500 clusters • 7th ranked 519,640-core cluster (Stampede) at TACC

• 11th ranked 74,358-core cluster (Tsubame 2.0) at Tokyo Institute of Technology

• 16th ranked 96,192-core cluster (Pleiades) at NASA

• 105th ranked 16,896-core cluster (Keenland) at GaTech and many others . . .

– Available with software stacks of many IB, HSE and server vendors including Linux Distros (RedHat and SuSE)

– http://mvapich.cse.ohio-state.edu 6

MVAPICH2/MVAPICH2-X Software

OSU-GTC-2014

http://mvapich.cse.ohio-state.edu/






• MPI and OpenACC

• On going work

7

Outline

OSU-GTC-2014

• Hybrid design using GPUDirect RDMA

– GPUDirect RDMA and Host-based pipelining

– Alleviates P2P bandwidth bottlenecks on SandyBridge and IvyBridge

• Support for communication using multi-rail

• Support for Mellanox Connect-IB and ConnectX VPI adapters

• Support for RoCE with Mellanox ConnectX VPI adapters

OSU-GTC-2014 8

GPUDirect RDMA (GDR) with CUDA

IB Adapter

System Memory

GPU Memory

GPU

CPU

Chipset

P2P write: 5.2 GB/s P2P read: < 1.0 GB/s

SNB E5-2670

P2P write: 6.4 GB/s P2P read: 3.5 GB/s

IVB E5-2680V2

SNB E5-2670 /

IVB E5-2680V2

S. Potluri, K. Hamidouche, A. Venkatesh, D. Bureddy and D. K. Panda, Efficient Inter-node MPI Communication using GPUDirect RDMA for InfiniBand Clusters with NVIDIA GPUs, Int'l Conference on Parallel Processing (ICPP '13)

9

Performance of MVAPICH2 with GPUDirect-RDMA: Latency GPU-GPU Internode MPI Latency

0

100

200

300

400

500

600

700

800

8K 32K 128K 512K 2M

1-Rail2-Rail1-Rail-GDR2-Rail-GDR

Large Message Latency

Message Size (Bytes)

Late

ncy

(us)

Based on MVAPICH2-2.0b Intel Ivy Bridge (E5-2680 v2) node with 20 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch

10 %

0

5

10

15

20

25

1 4 16 64 256 1K 4K


Small Message Latency


Late

ncy

(us)

67 %

5.49 usec

OSU-GTC-2014

10

Performance of MVAPICH2 with GPUDirect-RDMA: Bandwidth GPU-GPU Internode MPI Uni-Directional Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Small Message Bandwidth


Band

wid

th (M

B/s)

0

2000

4000

6000

8000

10000

12000

8K 32K 128K 512K 2M


Large Message Bandwidth


Band

wid

th (M

B/s)



5x

9.8 GB/s

OSU-GTC-2014

11

Performance of MVAPICH2 with GPUDirect-RDMA: Bi-Bandwidth



GPU-GPU Internode MPI Bi-directional Bandwidth

0

200

400

600

800

1000

1200

1400

1600

1800

2000

1 4 16 64 256 1K 4K


Small Message Bi-Bandwidth


Bi-B

andw

idth

(MB/

s)

0

5000

10000

15000

20000

25000

8K 32K 128K 512K 2M


Large Message Bi-Bandwidth


Bi-B

andw

idth

(MB/

s)

4.3x

19 %

19 GB/s

OSU-GTC-2014

12

Applications-level Benefits: AWP-ODC with MVAPICH2-GPU

Based on MVAPICH2-2.0b, CUDA 5.5, Mellanox OFED 2.0 with GPUDirect-RDMA Patch Two nodes, one GPU/node, one Process/GPU

0

50

100

150

200

250

300

350

Platform A Platform B

MV2 MV2-GDR11.6%

15.4%

Platform A: Intel Sandy Bridge + NVIDIA Tesla K20 + Mellanox ConnectX-3

Platform B: Intel Ivy Bridge + NVIDIA Tesla K40 + Mellanox Connect-IB

• A widely-used seismic modeling application, Gordon Bell Finalist at SC 2010

• An initial version using MPI + CUDA for GPU clusters

• Takes advantage of CUDA-aware MPI, two nodes, 1 GPU/Node and 64x32x32 problem

• GPUDirect-RDMA delivers better performance with newer architecture

050

100150200250300350

Platform A Platfrom AGDR

Platform B Platfrom BGDR

Computation Communication

21% 30%

OSU-GTC-2014

13

Continuous Enhancements for Improved Point-to-point Performance

Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores


OSU-GTC-2014

0

2

4

6

8

10

12

14

1 4 16 64 256 1K 4K 16K

MV2-2.0b-GDR

Improved



Late

ncy

(us)

GPU-GPU Internode MPI Latency

• Reduced synchronization and while avoiding expensive copies

31 %

14

Dynamic Tuning for Point-to-point Performance

Based on MVAPICH2-2.0b + enhancements Intel Ivy Bridge (E5-2630 v2) node with 12 cores


OSU-GTC-2014

GPU-GPU Internode MPI Performance

0

100

200

300

400

500

600

128K 512K 2M

Opt_latency

Opt_bw

Opt_dynamic

Latency


Late

ncy

(us)

0

1000

2000

3000

4000

5000

6000

1 16 256 4K 64K 1M

Opt_latency

Opt_bw

Opt_dynamic

Bandwidth


Bi-B

andw

idth

(MB/

s)






• MPI and OpenACC

• Conclusion

15

Outline

OSU-GTC-2014

16

One-sided communication

• Send/Recv semantics incur overheads • Distributed buffer information

• Message matching

• Additional copies or rendezvous exchange

• One-sided communication • Separates synchronization from communication

• Direct mapping over RDMA semantics

• Lower overheads and better overlap

4 bytes Host-Host GPU-GPU

IB send/recv 0.98 1.84

MPI send/recv 1.25 6.95

Table: Latency (half round trip) on SandyBridge nodes with FDR connect-IB

OSU-GTC-2014

17

MPI-3 RMA Support with GPUDirect RDMA

Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin

MPI-3 RMA provides flexible synchronization and completion primitives

0

2

4

6

8

10

12

14

16

18

20

1 4 16 64 256 1K 4K

Send-Recv

Put+ActiveSync

Put+Flush



Late

ncy

(use

c)

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1 4 16 64 256 1K 4K

Send-RecvPut+ActiveSync

Small Message Rate


Mes

sage

Rat

e (M

sgs/

sec)

OSU-GTC-2014

< 2usec

18

Communication Kernel Evaluation: 3Dstencil and Alltoall

Based on MVAPICH2-2.0b + Extensions Intel Sandy Bridge (E5-2670) node with 16 cores

NVIDIA Tesla K40c GPU, Mellanox Connect-IB Dual-FDR HCA CUDA 5.5, Mellanox OFED 2.1 with GPUDirect-RDMA Plugin

0102030405060708090

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


Send/Recv One-sided

3D Stencil with 16 GPU nodes

020406080

100120140

1 4 16 64 256 1K 4K

Lat

ency

(use

c)


Send/Recv One-sided

AlltoAll with 16 GPU nodes

OSU-GTC-2014

42% 50%






• MPI and OpenACC

• Conclusion

19

Outline

OSU-GTC-2014

Non-contiguous Data Exchange

20

• Multi-dimensional data – Row based organization – Contiguous on one dimension – Non-contiguous on other

dimensions

• Halo data exchange – Duplicate the boundary – Exchange the boundary in each

iteration

Halo data exchange

OSU-GTC-2014

MPI Datatye Processing • Comprehensive support

• Targeted kernels for regular datatypes - vector, subarray, indexed_block

• Generic kernels for all other irregular datatypes

• Separate non-blocking stream for kernels launched by MPI library • Avoids stream conflicts with application kernels

• Flexible set of parameters for users to tune kernels

• Vector • MV2_CUDA_KERNEL_VECTOR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_VECTOR_YSIZE

• Subarray • MV2_CUDA_KERNEL_SUBARR_TIDBLK_SIZE

• MV2_CUDA_KERNEL_SUBARR_XDIM

• MV2_CUDA_KERNEL_SUBARR_YDIM

• MV2_CUDA_KERNEL_SUBARR_ZDIM

• Indexed_block • MV2_CUDA_KERNEL_IDXBLK_XDIM

OSU-GTC-2014 21

22

Application-Level Evaluation (LBMGPU-3D)

• LBM-CUDA (Courtesy: Carlos Rosale, TACC) • Lattice Boltzmann Method for multiphase flows with large density ratios • 3D LBM-CUDA: one process/GPU per node, 512x512x512 data grid, up to 64 nodes

• Oakley cluster at OSC: two hex-core Intel Westmere processors, two NVIDIA Tesla M2070, one Mellanox IB QDR MT26428 adapter and 48 GB of main memory

050

100150200250300350400

8 16 32 64

Tota

l Exe

cutio

n Ti

me

(sec

)

Number of GPUs

MPI MPI-GPU5.6%

8.2% 13.5%

15.5%

3D LBM-CUDA

OSU-GTC-2014






• MPI and OpenACC

• Conclusion

23

Outline

OSU-GTC-2014

OSU-GTC-2014 24

More Optimizations!!!

• Topology-detection:

• Avoid the inter-sockets QPI bottlenecks

• Dynamic threshold selection between GDR and host-based transfers

• All these and other features will be available with the next release of MVAPICH2-GDR => coming very soon






• MPI and OpenACC

• Conclusion

25

Outline

OSU-GTC-2014

• OpenACC is gaining popularity • Several sessions during GTC • A set of compiler directives (#pragma) • Offload specific loops or parallelizable sections in code onto accelerators

#pragma acc region {

for(i = 0; i < size; i++) { A[i] = B[i] + C[i]; } } • Routines to allocate/free memory on accelerators

buffer = acc_malloc(MYBUFSIZE); acc_free(buffer);

• Supported for C, C++ and Fortran • Huge list of modifiers – copy, copyout, private, independent, etc..

OpenACC

26 OSU-GTC-2014

• acc_deviceptr to get device pointer (in OpenACC 2.0) – Enables MPI communication from memory allocated by compiler when it is available in

OpenACC 2.0 implementations – MVAPICH2 will detect the device pointer and optimize communication – Delivers the same performance as with CUDA

Using MVPICH2 with the new OpenACC 2.0

27 OSU-GTC-2014

A = malloc(sizeof(int) * N); …...

#pragma acc data copyin(A) { #pragma acc parallel for //compute for loop

MPI_Send(acc_deviceptr(A), N, MPI_INT, 0, 1, MPI_COMM_WORLD); }

…… free(A);

How can I get Started with GDR Experimentation? • MVAPICH2-2.0b with GDR support can be downloaded from https://mvapich.cse.ohio-state.edu/download/mvapich2gdr/

• System software requirements – Mellanox OFED 2.1

– NVIDIA Driver 331.20 or later

– NVIDIA CUDA Toolkit 5.5

– Plugin for GPUDirect RDMA

(http://www.mellanox.com/page/products_dyn?product_family=116)

• Has optimized designs for point-to-point communication using GDR

• Work under progress for optimizing collective and one-sided communication

• Contact MVAPICH help list with any questions related to the package [email protected]

• MVAPICH2-GDR-RC1 with additional optimizations coming soon!!

28 OSU-GTC-2014

http://www.mellanox.com/page/products_dyn?product_family=116

mailto:[email protected]


• MVAPICH2 optimizes MPI communication on InfiniBand clusters with GPUs

• Provides optimized designs for point-to-point two-sided and one-sided communication, and datatype processing

• Takes advantage of CUDA features like IPC and GPUDirect RDMA

• Delivers – High performance

– High productivity

With support for latest NVIDIA GPUs and InfiniBand Adapters

29

Conclusions

OSU-GTC-2014

OSU-GTC-2014 30

Acknowledgments Dr. Davide Rossetti and others @NVIDIA

OSU-GTC-2014 31

Want to improve the top500 ranking of your heterogeneous GPU Cluster?

Yes !!

Do not miss our next talk –

S4535 - Accelerating HPL on Heterogeneous Clusters with NVIDIA GPUs

Tuesday, 03/25 (today)

Room LL21A

17:00 – 17:25

Talk on Hybrid HPL for Heterogeneous Clusters

Thank You!

[email protected]

Network-Based Computing Laboratory http://nowlab.cse.ohio-state.edu/

MVAPICH Web Page http://mvapich.cse.ohio-state.edu/

32 OSU-GTC-2014


http://nowlab.cse.ohio-state.edu/



Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU...

Documents

Transcript of Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU...