Efficient and Scalable Communication Middleware …...Efficient and Scalable Communication...

Efficient and Scalable Communication Middleware for Emerging Dense-GPU Clusters

Ching-Hsiang Chu

Advisor: Dhabaleswar K. Panda

Network Based Computing LabDepartment of Computer Science and Engineering

The Ohio State University, Columbus, OH

SC 19 Doctoral Showcase 2Network Based Computing Laboratory

• Introduction

• Problem Statement

• Detailed Description and Results

• Broader Impact on the HPC Community

• Expected Contributions

Outline

Trends in Modern HPC Architecture: Heterogeneous

• Multi-core/many-core technologies

• High Performance Interconnects

• High Performance Storage and Compute devices

• Variety of programming models (MPI, PGAS, MPI+X)

Accelerators / Coprocessors high compute density,high performance/watt

High Performance Interconnects InfiniBand, Omni-Path, EFA

<1usec latency, 100Gbps+ Bandwidth

Multi/ Many-core Processors

SSD, NVMe-SSD, NVRAM

Node local storage

#1 Summit(27,648 GPUs)

#2 Sierra (17,280 GPUs)#10 Lassen (2,664 GPUs)

#8 ABCI(4,352 GPUs)

#22 DGX SuperPOD(1,536 GPUs)

• Scale-up (up to 150 GB/s)

– PCIe, NVLink/NVSwitch

– Infinity Fabric, Gen-Z, CXL

• Scale-out (up to 25 GB/s)

– InfiniBand, Omni-path, Ethernet

– Cray Slingshot

Trends in Modern Large-scale Dense-GPU Systems

• Various scientific applications are ported to GPU– Reportedly significant speedup compared to CPU version

– High-resolution/precision results

GPU-enabled HPC Applications

Derek Leinweber, CSSM, University of Adelaide

Lattice Quantum Chromodynamics Weather Simulation Wave propagation simulation

23x faster than CPU 2.8x faster than CPU 25x faster than CPU

Fuhrer O, Osuna C, Lapillonne X, Gysi T, Bianco M, Schulthess T. Towards GPU-accelerated operational weather forecasting. GTC 2013.

https://geodynamics.org/cig/software/specfem3d_globe/

• Easy-to-use and high-performance frameworks

• Wide range of applications– Image Classification– Speech Recognition– Self-driving car– Healthcare– Climate Analytic

GPU-enabled Emerging Deep Learning Applications

Kurth T, Treichler S, Romero J, Mudigonda M, Luehr N, Phillips E, Mahesh A, Matheson M, Deslippe J, Fatica M, Houston M. Exascale deep learning for climate analytics. SC 2018 Nov 11 (p. 51). (Golden Bell Prize)

999 PetaFlop/s sustained, and 1.13 ExaFlop/s peak FP 16 performance over 4560 nodes (27,360 GPU)

insideMVAPICH2

• Supports and optimizes various communication patterns

• Overlaps data movement from GPU with RDMA transfers

GPU-Aware (CUDA-Aware) Communication Middleware

MPI-based Generic Communication Middleware DL-Specific Communication Middleware

• Ring-based collective operations

• Optimized for DL workloads on GPU systems

Can we design a generic GPU-enabled communication middleware to fully exploit GPU resources and interconnects for traditional HPC and emerging ML/DL applications?

Broad Challenge

Overlap Productivity

User Naive

User Advanced

Farther from the center is Better

• Introduction

Outline

• What kind of hardware capabilities can be leveraged to fully exploit the modern interconnects deployed in GPU clusters?

• What is the potential scope of communication patterns can be benefited from the proposed GPU-enabled communication middleware?

• How to leverage GPU resources such as high-bandwidth memory (HBM) and massive streaming multiprocessors (SM) to accelerate communication?

• What are design considerations for a GPU-enabled communication middleware to efficiently utilize the hardware features?

• What kind of performance benefits can be expected with the proposed GPU-enabled communication middleware?

• How can the traditional HPC and DL applications can take advantage of the proposed communication middleware without application-level modifications?

Problem Statements

Modern HPCHardware

Programing Models

InterconnectsInfiniBand NVLink/NVSwitchPCIe

Multi-Core ProcessorsOpenPOWERXeon

GPU-EnabledCommunication Middleware

Scalable Streaming Broadcast

GPU-enabled Cooperative and Link-efficient reduction operations

Efficient Non-contiguous Data Process: Scheduling and Transfer

Message Passing Interface (MPI)

AcceleratorGPU

HardwareCapability Hardware Multicast Inter-GPU Direct Load/StoreGPUDirect RDMA

Research Framework

ComputeModels

OpenACC

ApplicationsTraditional HPC Applications

MILC COSMO

Machine/Deep Learning

TensorFlow CNTK

Benchmarks

StreamingOMB AWP-ODC HOOMD-Blue

GDRCOPY

• Introduction

• Detailed Description and Results – Scalable Streaming Broadcast for InfiniBand Networks– Efficient Scheduling of Non-contiguous Data Transfer– GPU-enabled Zero-copy Transfer for Non-contiguous Data – GPU-enabled Reduction operations

Outline

• Streaming & Deep Learning applications– Large-scale broadcast operations

– High computation-communication overlap

– No application-level modification

Motivated Example #1 – Need of scalable broadcast

Data Source

Data Distributor / Parameter Server

Online/Offline streaming

WorkerCPUGPU

Data streaming-like broadcast operations

Ching-Hsiang Chu, et al., "Exploiting Hardware Multicast and GPUDirect RDMA for Efficient Broadcast," IEEE TPDS, vol. 30, no. 3, pp. 575-588, 1 March 2019

GPU Node 1

GPU Node 2 GPU Node 4

GPU Node 3

• Streaming data through host– Fine-tuned chunked data

Ø Three-stage pipeline

– Leverages IB Scatter-Gather and GDR features

Ø Free-up PCIe resources for applicationsIB

Switch

IB HCA

Destination 1Header

IB HCA

Destination NHeader

IB Scatter (GDR Write)

Proposed Efficient and Scalable Solution

MPI_Bcast(d_in,…)

IB HCA

SourceHeader

1. Data Preparation2. IB Gather 3. IB Hardware Multicast

im_buf

MPI_Bcast(d_out,…)

• Evaluated @ RI2 GPU nodes

Performance Evaluation

8 16 8 16 8 16

AlexNet VGG ResNet-50

Scale (Number of GPU nodes)

*CA-CNTK - Image Classification

4 GPU Nodes 8 GPUs Nodes 16 GPU Nodes

Streaming WorkloadKnomial-GDRRing-GDR-PipelineTA-Zcpy-MCAST-GDR-Pipeline

*D. S. Banerjee, K. Hamidouche and D. K. Panda, "Re-Designing CNTK Deep Learning Framework on Modern GPU Enabled Clusters," CloudCom 2016.

• Based on the architecture on RI2 cluster

Performance Model Validation and Prediction

100000

2 4 8 16 32 64128

256512

10242048

Number of Broadcast Sources

K-nomial-based: Model-based EstimationRing-based: Model-based EstimationMCAST-GDR-Opt: Model-based Estimation

2 4 8 16

Number of Broadcast Sources

K-nomial-based: Model-based EstimationK-nomial-based: ExperimentRing-based: Model-based EstimationRing-based: ExperimentMCAST-GDR-Opt: Model-based EstimationMCAST-GDR-Opt: Experiment

Within 10% of error

𝑴 = 2𝑀𝐵; 𝑪 = 512 𝐾𝐵; 𝑼 = 4 𝐾𝐵;𝑩𝑯 ≈ 100 𝐺𝑏𝑝𝑠;𝑩𝑷𝑪𝑰𝒆 = 8 𝐺𝑏𝑝𝑠; 𝒕𝒐 𝒏 ≈1𝛼× ln 𝑛 , 15 ≤ 𝛼 ≤ 20

• Introduction

• Ongoing Work and Future Research Directions

Outline

• Wide usages of MPI derived datatype for Non-contiguous Data Transfer

– Requires Low-latency and high overlap processing

Motivated Example #2 – Non-contiguous Data Transfer

M. Martinasso, G. Kwasniewski, S. R. Alam, Thomas C. Schulthess, and T. Hoefler. “A PCIe congestion-aware performance model for densely populated accelerator servers. “ SC 2016

Weather Simulation: COSMO model

Mike Clark. “GPU Computing with QUDA, “Developer Technology Group, https://www.olcf.ornl.gov/wp-content/uploads/2013/02/Clark_M_LQCD.pdf

Quantum Chromodynamics: MILC with QUDA

Progress

Isend(1)

Wait For Kernel(WFK)

Kernel on Stream

Isend(1)Existing Design

Proposed Design

Kernel on Stream

Isend(2)Isend(3)

Kernel on Stream

Isend(1)

Kernel on Stream

Isend(1) Wait

Progress

Start Finish Proposed Finish Existing

Expected Benefits

Existing GPU-enabled MPI Datatype Processing

Waste of computing resources on CPU and GPUCommon Scenario

*A, B…contain non-contiguous MPI Datatype

MPI_Isend (A,.. Datatype,…)MPI_Isend (B,.. Datatype,…)MPI_Isend (C,.. Datatype,…)MPI_Isend (D,.. Datatype,…)…

MPI_Waitall (…);

Ching-Hsiang Chu et al., "Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, " IEEE IPDPS 2016.

Proposed Event-based Design – Low Latency

MPI_Isend()stream

1cudaEventRecord()

HCA CPU GPU

MPI_Waitall()Query / Progress

CompletionRequest

Complete

pack_kernel1<<< >>>

pack_kernel2<<< >>>

pack_kernel3<<< >>>cudaEventRecord()

cudaEventRecord()

MPI_Isend()

stream3

stream2

Default stream

Proposed Callback-based Design – High Overlap

MPI_Isend()addCallback()

HCA CPU GPU

CompletionRequestComplete

main helper callback

pack_kernel1<<< >>>

Callback

MPI_Waitall()

addCallback()

addCallback()CPU

Computations

pack_kernel2<<< >>>

pack_kernel3<<< >>>

MPI_Isend()

CallbackCallback

stream1

stream2

stream3D

efault stream

Application-level (COSMO HaloExchange) Evaluation

16 32 64 96Nor

Number of GPUs

CSCS GPU clusterDefault Callback-based Event-based

00.20.40.60.8

4 8 16 32Nor

Number of GPUs

Wilkes GPU ClusterDefault Callback-based Event-based

2X 1.6X

MPI_Isend(Buf1, ...,request1); MPI_Isend(Buf2, ...,request2); MPI_Wait (request1, status1);MPI_Wait (request2, status2);

Ching-Hsiang Chu et al., "Exploiting Maximal Overlap for Non-Contiguous Data Movement Processing on Modern GPU-enabled Systems, " IEEE IPDPS 2016.

• Introduction

Outline

• Exploiting load-store capability of modern interconnects– Eliminate extra data copies and expensive packing/unpacking processing

Proposed Zero-copy (packing-free) Datatype TransferSo

Load-Store

Existing Packing Schem Proposed Packing-free Schem

Ching-Hsiang Chu et al., “High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems”, HiPC 2019.

Performance Evaluation• Zero-copy (packing-free) for GPUs with peer-to-peer direct access over PCIe/NVLink

[6, 8,8,8,8] [6, 8,8,8,16] [6, 8,8,16,16] [6, 16,16,16,16]

Problem size

GPU-based DDTBench mimics MILC communication kernel

OpenMPI 4.0.0 MVAPICH2-GDR 2.3.1 ProposedPlatform: Nvidia DGX-2 system

16 32 64

Number of GPUs

Communication Kernel of COSMO Model

MVAPICH2-GDR 2.3.1 ProposedPlatform: Cray CS-Storm

Improved 3.4X

(https://github.com/cosunae/HaloExchangeBenchmarks)

Improved 15X

Ching-Hsiang Chu et al., “High-Performance Adaptive MPI Derived Datatype Communication for Modern Multi-GPU Systems”, HiPC 2019.

• Introduction

Outline

Motivated Example #3 – Reduction Op. for DL Training

• Can GPU resources help improving compute-intensive communications?– E.g., MPI_Reduce, MPI_Allreduce, MPI_Scan

– Emerging distributed deep learning training• Exchange and update weights

– Requires fast and high-bandwidth solutions

https://www.oreilly.com/ideas/distributed-tensorflow

Ben-Nun T, Hoefler T. Demystifying parallel and distributed deep learning: An in-depth concurrency analysis. arXivpreprint arXiv:1802.09941. 2018 Feb 26.

• Existing designs1. Explicit copy the data from GPU to host memory

2. Host-to-Host communication to remote processes

3. Perform computation on CPU

4. Explicit copy the data from host to GPU memory

• Proposed designs1. GPU-to-GPU communication

• NVIDIA GPUDirect RDMA (GDR)

• Pipeline through host for large msg

2. Perform computation on GPU• Efficient CUDA kernels

How to leverage GPUs for MPI Reduction Operations?

Host Memory

PCIe IB Adapter

Host Memory

PCIeIB Adapter1

Node BNode A

Expensive!

Good for small data Relative slow for large data

Ching-Hsiang Chu et al., "CUDA Kernel based Collective Reduction Operations on Large-scale GPU Clusters, " IEEE/ACM CCGrid 2016

Communication Computation Design Algorithm Benefit

Host<->HostCPU

BR-H-HH (Default) Binomial-Reduce Large scale, small messagesRD-H-HH (Default) Recursive doubling

GR-H-HH

Gather-Reduce Small scale, small messages

Host<->Device (GDR) GR-HD / GR-DH

Device<->Device(GDR)

BR-DD Binomial-Reduce

Large messagesfor any scale

BRB-DD Binomial-Reduce-Bcast

RD-DDRecursive doubling

Host<->Device (GDR) RD-HD/RD-DH

Alternative and Extended Designs

• Grouping GPUs which are fully connected by NVLinks– Contention-free communication within the group

• Cooperative Reduction Kernels to exploit load-compute-storeprimitives over NVLinks

Proposed NVGroup Allreduce

Ching-Hsiang Chu et al., “NV-Group: Cooperative and Link-Efficient Reductions for Deep Learning on NVLink-enabled Dense GPU Systems, ” (to be submitted)

Preliminary Results – Allreduce Benchmark

0123456

32M 64M 128M 256M

Message Size (Bytes)

Bandwidth on 1,536 GPUs

Proposed NVGroupNCCL 2.4

1.7X better

4 16 64 256 1K 4K 16K

Message Size (Bytes)

Latency on 1,536 GPUs

NVGroup NCCL 2.4

1.6X better

#1 Summit Platform: Dual-socket IBM POWER9 CPU, 6 NVIDIA Volta V100 GPUs, and 2-port InfiniBand EDR Interconnect

24 48 96 192 384 768 1536

Number of GPUs

128MB Message

SpectrumMPI 10.2.0.11 OpenMPI 4.0.1

NCCL 2.4 Proposed NVGroup

1.7X better

Preliminary Results – Distributed Deep Learning Training• ResNet-50 Training using TensorFlow benchmark on a DGX-2 machine (16 Volta GPUs)

1 2 4 8 16

Number of GPUs

NCCL-2.4 MVAPICH2-GDR 2.3.1 Ideal

8% higher

0102030405060708090

1 2 4 8 16

Number of GPUs

NCCL-2.4 Proposed

Scaling EfJiciency =Actual throughput

Ideal throughput at scale×100%

• Introduction

Outline

• High Performance open-source MPI Library for InfiniBand, Omni-Path, Ethernet/iWARP, and RDMA over Converged Ethernet (RoCE)

– MVAPICH (MPI-1), MVAPICH2 (MPI-2.2 and MPI-3.1), Started in 2001, First version available in 2002

– MVAPICH2-X (MPI + PGAS), Available since 2011

– Support for GPGPUs (MVAPICH2-GDR) and MIC (MVAPICH2-MIC), Available since 2014

– Support for Virtualization (MVAPICH2-Virt), Available since 2015

– Support for Energy-Awareness (MVAPICH2-EA), Available since 2015

– Support for InfiniBand Network Analysis and Monitoring (OSU INAM) since 2015

– Used by more than 3,000 organizations in 89 countries

– More than 553,000 (> 0.5 million) downloads from the OSU site directly

– Empowering many TOP500 clusters (June ‘19 ranking)

• 3rd ranked 10,649,640-core cluster (Sunway TaihuLight) at NSC, Wuxi, China

• 16th, 556,104 cores (Oakforest-PACS) in Japan

• 19th, 367,024 cores (Stampede2) at TACC

• 31st, 241,108-core (Pleiades) at NASA and many others

– Available with software stacks of many vendors and Linux Distros (RedHat, SuSE, and OpenHPC)

– http://mvapich.cse.ohio-state.edu

MVAPICH2 Project

Partner in the 5th ranked TACC Frontera System

Empowering Top500 systems for over a decade

• Accelerating GPU-enabled HPC applications worldwide– MVAPICH2-GDR is widely used on many top-ranked GPU clusters worldwide including

Summit (#1), Sierra (#2), ABCI (#8), Lassen (#10) and more

• Enabling fast weather forecasting– MeteoSwiss uses the COSMO numerical weather forecasting model for the production of

regional and local forecast products

– Use MVAPICH2-GDR to accelerate GPU Communication

• Supporting scalable and reliable data dissemination– DoD streaming applications are using MVAPICH2-GDR to accelerate GPU-based broadcast

operations

– Tutorial sessions: PETTT ’17, PETTT ‘18

Impact to the Community

• Introduction

Outline

• Improving Scale-out and Scale-up performance by exploiting features in modern Interconnects

– Enabling Scalable Broadcast by using InfiniBand hardware multicast and GPUDirect RDMA

– Link-efficient schemes by exploiting load-store primitives over GPU interconnects

• Efficient GPU-enabled Communication Middleware– GPU-enabled MPI derived datatype processing

– GPU-enabled reduction operations

• Significant impact on the community– The abstraction for accelerator-enabled communication middleware

– Benefit HPC & ML/DL workloads

– Broader outreach through MVAPICH2-GDR public releases

Expected Contributions

Thank You!

Questions?

chu.368@osu.edu

http://web.cse.ohio-state.edu/~chu.368

Efficient and Scalable Communication Middleware …...Efficient and Scalable Communication...

Documents

Transcript of Efficient and Scalable Communication Middleware …...Efficient and Scalable Communication...

Scalable and Efficient Data Management in Distributed ...

Scalable and Reactive Multi Micro-Agents System Middleware ...

Designing Scalable and Efficient I/O Middleware for Fault …sc14.supercomputing.org/sites/all/themes/sc14/files/... · 2016-05-27 · Designing Scalable and Efficient I/O Middleware

Scalable SIMD-Efficient Graph Processing on GPUs

supporting efficient and scalable multicasting over MANET

Energy-Efficient Mobile Middleware for SIP on Ubiquitous ...dl.ifip.org/db/conf/networking/networking2008/... · Energy-Efficient Mobile Middleware for SIP on Ubiquitous Multimedia

Towards Scalable and Efficient Architecture for Modeling ...

CSSWare: A Middleware for Scalable Mobile Crowd …homepage.usask.ca/~ama883/publications/moamen-jamali-mobicase15... · CSSWare: A Middleware for Scalable ... Restaurant2,52.1156,-106.5997;

Efficient and Scalable Operating System Provisioning with ...

MINERVA Infinity: A Scalable Efficient Peer-to-Peer Search Engine Middleware 2005 Grenoble, France Sebastian Michel Max-Planck-Institut für Informatik.

TETRIS: Scalable and Efficient Neural Network Acceleration with 3D …platformlab.stanford.edu/pdf/Mingyu_Gao.pdf · TETRIS: Scalable and Efficient Neural Network Acceleration with

A Scalable Multilayer Middleware for Distributed …endler/paperlinks/MusaNet-Meslim...A Scalable Multilayer Middleware for Distributed Monitoring and Complex Event Processing for

Chapter 5 Application Server Middleware - - TU Kaiserslautern · WS 2010/11 4 Scalable and Efficient Application Processing Managing large workloads one process per client is not

Scalable, Fault-tolerant Management of Grid Services: Application to Messaging Middleware

Computational Contributions Towards Scalable and Efficient ...

WReX: A Scalable Middleware Architecture to Enable XML ...candan/papers/wrex.pdf · WReX: A Scalable Middleware Architecture to Enable XML Caching for Web-Services Junichi Tatemura

Message-oriented Middleware for Scalable Data …kth.diva-portal.org/smash/get/diva2:813137/FULLTEXT01.pdf · Message-oriented Middleware for Scalable Data Analytics Architectures

ALGORITHMS FOR COMPLETE, EFFICIENT, AND SCALABLE …

Implementing Efficient and Scalable Flow Control …mvapich.cse.ohio-state.edu/static/media/publications/slide/liuj... · Implementing Efficient and Scalable Flow Control Schemes

Sparker: Efficient Reduction for More Scalable Machine ...