Design challenges of High- performance and Scalable MPI...

37
Design challenges of High- performance and Scalable MPI over InfiniBand Presented by Karthik

Transcript of Design challenges of High- performance and Scalable MPI...

Page 1: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Design challenges of High-performance and Scalable

MPI over InfiniBand

Presented by Karthik

Page 2: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Presentation Overview

• In depth analysis of High-Performance and scalable MPI with Reduced Memory Usage

• Zero Copy protocol using Unreliable Datagram

• MVAPICH-Aptus : A scalable High performance Multi-Transport MPI over InfiniBand

Page 3: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

Motivation

• Does aggressively reducing communication buffer memory lead to degradation of end application performance?

• How much memory can we expect the MPI library to consume during execution of a typical application, while still proving the best available performance ?

Page 4: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

IB provides several types of transport services – • Reliable Connection (RC) - Used as the primary transport for MVAPICH and other MPIs over InfiniBand. - Most feature-rich -- supports RDMA and provides reliable service. - Dedicated QP must be created for each communicating peer. • Reliable Datagram (RD) - Most of the same features as RC, however, a dedicated QP is not required. - Not implemented with current hardware. • Unreliable Connection (UC) - Provides RDMA capability. - No guarantees on ordering or reliability. - Dedicated QP must be created for each communicating peer. • Unreliable Datagram (UD) - Connection-less. Single QP can communicate with any other peer QP. - Limited message size. - No guarantees on ordering or reliability.

Page 5: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Upper level software service

High Performance and Scalable MPI with Reduced Memory usage

Shared Receive Queue - This allows multiple QPs to be attached to one receive queue (even for connection oriented transport) - This approach is memory efficient

Page 6: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

Remote Direct Memory Access (RDMA) - Application can directly access the memory of the remove process. - RDMA has very low latency.

Page 7: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

MVAPICH Design Overview

MVAPICH uses two major protocols – 1. Eager Protocol - It is used to transfer small messages. - The messages are buffered inside the MPI library. - “pre-allocated” communication buffers are required on the sender and receiver side 2. Rendezvous Protocol - It is used to transfer large messages. - The message are sent directly to receiver’s user memory.

Page 8: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

1 . Adaptive RDMA with Send/Receive - In order to avoid a memory-scalability problem when the number of nodes increase, this channel is adaptive. - Limited buffers are allocated initially. - Once a threshold number of messages are exchanged, next messages are transferred using RDMA.

Page 9: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

2. Adaptive RDMA with SQR Channel - Idea is based on ARDMA-SR. Only Difference is the Shared Queue Receiver is used. - Drawback : Sender doesn’t know the receiver buffer availability. - Solution : Setting a “low-watermark” for the SQR.

Page 10: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

3. Shared Receive Queue - This channel exclusively utilizes the SRQ feature. - This follows the same “low-watermark technique as the ARDMA-SRQ. - Even though RDMA has low latency, they consume more memory.

Page 11: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

NAS Benchmark

Page 12: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

High Performance and Scalable MPI with Reduced Memory usage

High Performance Linpack - Benchmark for solving linear equations. - It is used as the primary measure for ranking biannual Top 500 list of the world’s fastest supercomputers

Page 13: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand

Unreliable Datagram

Page 14: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

1. Performance Scalability - Memory copies are detrimental to the overall performance of the application. - HCA cache can only hold a limited number of QPs 2. Resource Scalability - With a connection oriented transport the memory requirements increase linearly with the number of connected processes.

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Motivation

Page 15: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Traditional Zero-Copy

1. Matched Queues Interface - The receiver deciphers the message tag from the sent message and matches it with the posted receive operations. 2. Rendezvous Protocol using RDMA - Initially a handshake protocol is used, followed by RDMA.

Page 16: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

UD vs RC memory usage For 16k connections – UD = 40 MB / process RC = 240 MB / process

Page 17: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Challenges for true zero copy design

• Limited MTU Size - UD transport has a Maximum Transfer Unit(MTU) limit of 2KB. - Segmentation required. • Lack of dedicated Receive Buffers - Difficult to post receive buffers for a particular peer as they are all shared. - If no buffer is posted to a QP, message sent is silently dropped. • Lack of Reliability - There is no guarantee that a message will arrive at the receiver • Lack of ordering - Message may not arrive in the same order they are sent. • Lack of RDMA - RDMA only works for connection oriented transport.

Page 18: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Proposed Design

- Design is based on serialized communication since RDMA is not specified for UD transport - Serialized implies that the order of transfer is agreed beforehand, and only sender transmit to a QP at a single time.

Page 19: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Solutions to design challenges

1. Efficient Segmentation - The design chooses to get completion signal only for the last packet. - The underlying reliability layer would mark packets as missing at the receiver’s end and the sender is notified. 2. Zero Copy Pool - A pool of QPs are maintained. - When a message transfer is initiated, a QP is taken from the pool and the application receive buffer is posted to it. 3. Optimized Reliability and Ordering for Large Messages - One approach is the perform a checksum for the entire receive buffer. - Each operation can specify a 32-bit immediate field that will be available to the receiver as part of the completion entry.

Page 20: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Experimental Evaluation

Ping Pong Latency

Page 21: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Uni-Directional Bandwidth

Page 22: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Bi-Directional Bandwidth

Page 23: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance

Multi-Transport MPI over InfiniBand

Page 24: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Motivation This paper seeks to address two mains questions - 1. What are the different protocols developed for MPI over IB ? How well do they perform at scale ? 2. Given this knowledge, can the MPI Library be designed to dynamically select protocols to optimized for performance and scalability ?

Page 25: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

IB provides several types of transport services – • Reliable Connection (RC) - Used as the primary transport for MVAPICH and other MPIs over InfiniBand. - Most feature-rich -- supports RDMA and provides reliable service. - Dedicated QP must be created for each communicating peer. • Reliable Datagram (RD) - Most of the same features as RC, however, a dedicated QP is not required. - Not implemented with current hardware. • Unreliable Connection (UC) - Provides RDMA capability. - No guarantees on ordering or reliability. - Dedicated QP must be created for each communicating peer. • Unreliable Datagram (UD) - Connection-less. Single QP can communicate with any other peer QP. - Limited message size. - No guarantees on ordering or reliability.

Page 26: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Message Channel

Eager Protocol Channel

Page 27: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Message Channel

Rendezvous Protocol Channel

Page 28: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Performance : Eager Latency

Page 29: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Performance : Uni-Directional Bandwidth

Page 30: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Scalability Test : Memory Usage

Page 31: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Evaluation

Scalability Test : Latency

Page 32: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

MVAPICH-Aptus : Scalable High-Performance Multi-Transport MPI over InfiniBand

Channel Characteristics Summary

Page 33: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Overview of Design

• As seen from the experimental results, using only one channel is not sufficient to achieve performance and scalability. • The solution is to use a combination of message channels and transports to optimize for performance as well as scalability.

Design Challenges

1. When should a channel be created ? 2. When should a channel be used ?

Page 34: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Channel Allocation

Page 35: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

• From the experimental results we can see the channels behave differently for different message size • A flexible form is defined when sending a message

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Channel Usage

• Using this flexible framework, send rules can be changed on a per-system or job level to meet application needs without changing the code within MPI library.

Page 36: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

Zero-Copy Protocol for MPI using InfiniBand Unreliable Datagram

Performance Evaluation

Page 37: Design challenges of High- performance and Scalable MPI ...web.cse.ohio-state.edu/~panda.2/788/slides/4d_4f_mpi_scalability.pdf1. Performance Scalability - Memory copies are detrimental

QUESTIONS ?