Implementing Efficient and Scalable Flow Control...

Implementing Efficient and Scalable Flow Control Schemes

in MPI over InfiniBand

Jiuxing Liu and Dhabaleswar K. Panda

Computer Science and EngineeringThe Ohio State University

Presentation Outline

• Introduction and overview• Flow control design alternatives• Performance Evaluation• Conclusion

Introduction

• InfiniBand is becoming popular for high performance computing

• Flow control is an important issue in implementing MPI over InfiniBand– Performance– Scalability

InfiniBand Communication Model

• Different transport services– Focus on Reliable Connection (RC) in this paper

• Queue-pair based communication model• Communication requests posted to send or

receive queues• Communication memory must be registered• Completion detected through Completion

Queue (CQ)

InfiniBand Send/Recv and RDMA Operations

User-level Buffer

Registered Buffer

IB Fabric

User-level Buffer

Registered Buffer

Send Descriptor Receive Descriptor

Send/Recv Model

User-level Buffer

Registered Buffer

IB Fabric

Registered User-level Buffer

Send Descriptor

RDMA Model

•For Send/Recv, a send operation must be matched with a pre-posted receive (which specifies a receive buffer)

End-to-End Flow Control Mechanism in InfiniBand

• Implemented at the hardware level• When there is no recv buffer posted

for an incoming send packet– Receiver sends RNR NAK– Sender retries

MPI Communication Protocols

Receive

Eager Data

Eager Protocol Rendezvous Protocol

Rendezvous Start

Rendezvous Reply

Rendezvous Data

Rendezvous Finish

Receive

Expected and Unexpected Messages in MPI Protocols

• Expected Messages:– Rendezvous Reply– Rendezvous Data– Rendezvous Finish

• Unexpected Message:– Eager Data– Rendezvous Start

Why Flow Control is Necessary

• Unexpected messages need resources (CPU time, buffer space, etc)

• MPI itself does not limit the number of unexpected messages– Receiver may not be able to keep up– Resources may not be enough

• Flow control (in the MPI implementation) is needed to avoid the above problems

• Send/Recv operations used for Eager protocol and control messages in Rendezvous protocol– Can also exploit RDMA (not used in this

paper)• RDMA used for Rendezvous Data• Unexpected messages are from

Send/Recv

InfiniBand Operations used for Protocol Messages

• Introduction and overview • Flow control design alternatives• Performance Evaluation• Conclusion

Flow Control Design Outline

• Common issues• Hardware based• User-level static• User-level dynamic

Flow Control Design Objectives

• Need to be effective– Preventing the receiver from being

overwhelmed• Need to be efficient

– Very little run time overhead– No unnecessary stall of communication– Efficient buffer usage

• How many buffers for each connection

Classification of Flow Control Schemes

• Hardware based vs. User-level– Hardware based schemes exploit InfiniBand

end-to-end flow control– User-level schemes implements flow control in

MPI implementation• Static vs. Dynamic

– Static schemes use a fixed number of buffers for each connection

– Dynamic schemes can adjust the number of buffers during execution

Hardware-Based Flow Control

• No flow control in MPI• Rely on InfiniBand end-to-end flow

control• Implemented entirely in hardware and

transparent to MPI

Advantages and Disadvantages of the Hardware-Based Scheme

• Advantages– Almost no run-time overhead at the MPI layer during

normal communication– Flow control mechanism makes progress independent of

application • Disadvantages

– Very little flexibility– The hardware flow control scheme may not be the best

for all communication patterns– Separation of buffer management and flow control

• No information to MPI to adjust its behavior • Difficult to implement dynamic schemes

User-Level Static Schemes

• Flow control handled in MPI implementation

• Fixed number of buffers for each connection

• Credit-based scheme– Piggybacking– Explicit credit messages

Problems of the User-Level Static Scheme

• More overhead at the MPI layer (for credit management)

• Flow control progress depends on application

• Buffer usage is not optimal, may result in:– Wasted buffer for some connections– Unnecessary communication stall for other

connections

User-Level Dynamic Schemes

• Similar to the user-level static schemes

• Start with only a few buffers for each connection

• Use a feedback based control mechanism to adjust the number of buffers based on communication pattern

User-Level Dynamic Scheme Design Issues

• How to provide information feedback– When no credit, sender will put a message into a

“backlog”– Message will be sent when more credits are available– Tag messages to indicate if they have gone through the

backlog• How to respond to feedback

– Increase the number of buffers for the connection if a receiver gets a message that has gone through the backlog

– Linear or exponential increase

• Introduction and overview • Flow control design alternatives• Performance Evaluation• Conclusion

Experimental Testbed

• 8 SuperMicro SUPER P4DL6 nodes (2.4 GHz Xeon, 400MHz FSB, 512K L2 cache)

• Mellanox InfiniHost MT23108 4X HCAs (A1 silicon), PCI-X 66bit 133MHz

• Mellanox InfiniScale MT43132 switch

Outline of Experiments

• Microbenchmarks– Latency– Bandwidth

• MPI Blocking and Non-blocking Functions• Small and large messages

• NAS Benchmarks– Running time– Communication characteristics related to flow

control

4 8 16 32 64 128 256 512 1024

Size (Bytes)

Hardware LevelUser Level StaticUser Level Dynamic

MPI Latency

In latency tests, flow control usually is not an issue because communication is symmetric

All schemes perform the same, which means that user level overhead is very small in this case

Blocking MPI Calls

00.20.40.60.8

2 6 10 14 18 22 26 30 34 38 42 46 50Window Size

Non-Blocking MPI Calls

00.20.40.60.8

2 6 10 14 18 22 26 30 34 38 42 46 50

Window SizeB

MPI Bandwidth (Prepost = 100 and Size = 4 Bytes)

With enough buffers, all schemes perform comparably for small messages

Blocking and Non-blocking MPI calls performs comparably for small messages because message are copied and sent eagerly

Blocking MPI Calls

2 6 10 14 18 22 26 30 34 38 42 46 50Window Size

2 6 10 14 18 22 26 30 34 38 42 46 50

Window SizeBa

MPI Bandwidth (Prepost = 10 and Size = 4 Bytes)

Buffers are not enough, which triggers flow control mechanisms user level dynamic performs the best, user level static performs

the worst

Blocking MPI Calls

2 6 10 14 18 22 26 30 34 38 42 46 50Window Size

2 6 10 14 18 22 26 30 34 38 42 46 50

Window SizeBa

MPI Bandwidth (Prepost = 10 and Size = 32 KB)

Large messages use Rendezvous protocol which has two-way traffic

All schemes perform comparablyNon-blocking calls give better performance

NAS Benchmarks (Pre-post = 100)

IS FT LU CG MG SP BT

Hardware-LevelUser-Level StaticUser-Level Dynamic

All schemes perform comparably when given enough buffers

NAS Benchmarks (Pre-post = 1)

0% 0%2% 1%

6%2% 0% 0%0% 0% 0% 0% 0% 1% 0%

IS FT LU CG MG SP BT

Hardware-BaseUser-Level Static

User-Level Dynamic

Even with very few buffers, most applications still perform wellLU performs significant worse for hardware based and user-level

staticOverall, user-level dynamic gives best performance

Explicit Credit Messages for User-Level Static Schemes

145310SP289130BT15951MG42020CG488059002LU1930FT3830IS#Total Msg#ECMApp

Piggybacking is quite effectiveIn LU, the number of explicit credit messages is high

Maximum Number of Buffers for User-Level Dynamic Schemes

7SP7BT6MG3CG63LU4FT4IS#BufferApp

Almost all applications only need a few (less than 8) buffers per connection for optimal performance

LU requires more buffers

Conclusions• Three different flow control schemes for

MPI over InfiniBand• Evaluation in terms of overhead and buffer

efficiency• Many applications (like those in NAS)

require a small number of buffers for each connection

• User-Level Dynamic Scheme can achieve both good performance and buffer efficiency

Future Work

• More application level evaluation• Evaluate using larger scale systems• Integrate the schemes with our

RDMA based design for small messages

• Exploit the recently proposed Shared Receive Queue (SRQ) feature

Web Pointers

http://www.cis.ohio-state.edu/~panda/http://nowlab.cis.ohio-state.edu/

http://nowlab.cis.ohio-state.edu/projects/mpi-iba/

NBC home page

MVAPICH home page

Implementing Efficient and Scalable Flow Control...

Documents

Transcript of Implementing Efficient and Scalable Flow Control...

…A Scalable Systems Initiative Scalable Microsoft ...€¦ · A Scalable Systems Initiative Scalable Microsoft Application Re engineering Technology Scalable Systems Inc Web: systems.com

Implementing Randomized Matrix Algorithms in Parallel and … › resources › bd_mahoney.pdf · 2013-04-05 · 2 Solving ‘ 2 regression using MPI ... I more scalable: modern massive

Implementing Scalable Geoweb Applications Using Cloud and ...6507... · Implementing Scalable Geoweb Applications Using Cloud and Internet Computing By Seyed Emad Mousavi BSc Geomatics

Can Scalable Development Lead to Scalable Execution? · Can Scalable Development Lead to Scalable Execution? Damian Rouson Scalable Computing R&D Sandia National Laboratories Sponsors:

IMPLEMENTING A SCALABLE CPV SOLUTION · The BIOVIA solution for CMOs is a scalable data quality monitoring system that can be quickly deployed and easily shared with Sponsor organizations

IMCSummit 2015 - Day 2 Developer Track - Implementing a Highly Scalable In-Memory Stock Prediction System with Apache Geode, R and Spring XD

February 2017 Delivering Scalable and Robust Data Infrastructures … Delivering Scalable and Robust Data Infrastructures with DaaS in Financial Markets Implementing Gartner’s responsive

A scalable and customizable processor array for ...ccc.inaoep.mx › ~rcumplido › papers › 2016-Letras-A-Scalable...A scalable and customizable processor array for implementing

© 2009 Cisco Systems, Inc. All rights reserved. ROUTE v1.0—3-1 Implementing a Scalable Multiarea Network OSPF- Based Solution Configuring and Verifying.

Scalable Transactions for Scalable Distributed Database ...digitalassets.lib.berkeley.edu/etd/ucb/text/Pang_berkeley_0028E_15408.pdf · Scalable Transactions for Scalable Distributed

Implementing a scalable ospf based solution

LoSHa: A General Framework for Scalable Locality …jcheng/papers/LoSHa_sigir17.pdflutions. Implementing a distributed LSH algorithm from scratch requires high development costs, thus

Asynchronous Architectures for Implementing Scalable Cloud Services - Evan Cooke - Gluecon 2012

Implementing Scalable CAN Security with CANcrypt · Implementing Scalable CAN Security with CANcrypt Authentication and encryption for the Controller Area Network and CANopen by Olaf

IMPLEMENTING SCALABLE AND COST- EFFECTIVE SESSION BORDER CONTROL

Implementing a Scalable Multiarea Network OSPF-Based Solution

Implementing Scalable Structured Machine Learning for …jens-lehmann.org/files/2017/ieee_bigdata_sake.pdf · Implementing Scalable Structured Machine Learning for Big Data in the

CS 405G: Introduction to Database Systemsprotocols.netlab.uky.edu/~liuj/teaching/CS405G_f14/notes/pdf/01... · Topics for Today Topics Introduction What is a database? What is a database

Designing and Implementing Scalable Applications with ... · PDF fileScalable Applications with Memcached and MySQL ... scalable access by applications. Leveraging a caching system

IMPLEMENTING SCALABLE GEOWEB APPLICATIONS USING · PDF file · 2014-06-06Implementing Scalable Geoweb Applications Using Cloud and Internet Computing. ... Geospatial Consortium’s