Gilad Shainer

Cloud Native Supercomputing

Gilad Shainer

2

The New Universe of Scientific Computing

3

Traditional HPC Data Center

COMPUTE

Usage (

App

/ C

om

m)

Pin

g L

ate

ncy

HP

C F

ram

ew

ork

Late

ncy

4

RDMA-Accelerated HPCU

sage (

App

/ C

om

m)

Pin

g L

ate

ncy

HP

C F

ram

ew

ork

Late

ncy

COMPUTE

5

GPUDirect RDMA Example

6

In-Network Computing-Accelerated HPC U

sage (

App

/ C

om

m)

Pin

g L

ate

ncy

HP

C F

ram

ew

ork

Late

ncy

COMPUTE

7

In-Network Computing Accelerated Supercomputing

• Software-Defined, Hardware-Accelerated

• InfiniBand In-Network Computing Examples:

– Data reductions (SHARP)

– All-to-All

– MPI Tag-Matching

7X Better MPI Latency 2.5X Better AI Performance

8

Data Aggregation Protocol

1 2 3 4 5 6 7 8

3-Level Fat Tree Network

Nodes

Switch Network

9

Traditional Data Aggregation

1 2 3 4 5 6 7 8 Nodes

Switch Network

Phase 1

10


1 2 3 4 5 6 7 8 Nodes

Switch Network

Phase 2

11


1 2 3 4 5 6 7 8 Nodes

Switch Network

Phase 3

12

SHARP In-Network Computing Data Aggregation

A

A A

AAAA

1 2 3 4 5 6 7 8 Nodes

Switch Network

13

Data Aggregation - Comparison

Low latencyOptimized data motion

No CPU latency addition

High latencyHigh amount of transferred data

CPU overhead

Traditional In-Network Computing

14

The Generation of Secured In-Network Computing HPC

CPU

Usage (

App

/ C

om

m)

Pin

g L

ate

ncy

HP

C F

ram

ew

ork

Late

ncy

15

Data Processing Units (DPUs)

• Data Center Infrastructure on a Chip

• Networking, storage, security services

• DPU architecture isolates data center

security policies from the host CPU

while building a zero-trust data center

domain at the edge of every server

• Enabling Cloud Native Supercomputing:

bare metal performance, secured HPC

16

Cloud Native Supercomputing Architecture

17

Multi Tenant Isolation

• Zero trust architecture

• Secured Network infrastructure and configuration

• Storage virtualization

• Tenant Service Level Agreement (SLA)

• 10Ks concurrent isolated users on single subnet

18

Higher Application Performance

• DPU Accelerated HPC Communications

• Collective offload with UCC accelerator

• Active messages

• Smart MPI progression

• Data compression

• User-defined algorithms

19

Collective Offload

• Non-blocking collective operations are offloaded to a set

of DPU ‘Worker’ processes

• Host processes prepare a set of metadata and provide it to

the DPU Worker processes

• DPU Worker processes access host memory via RDMA

• DPU Worker processes progress the collective on behalf

of the host processes

• DPU Worker processes notify the host processes once

message exchanges are completed

DPU DPU

HOST HOST

MEMORY MEMORY

RDMAREAD

RDMAWRITE

20


• 100% Communication – Computation Overlap

32 servers, Dual Socket Intel® Xeon® 16-core CPUs E5-2697A V4 @ 2.60 GHz (32 processes per node), NVIDIA

BlueField-2 HDR100 DPUs and ConnectX-6 HDR100 adapters, NVIDIA Quantum Switch QM7800 40-Port 200Gb/s

HDR InfiniBand, 256GB DDR4 2400MHz RDIMMs memory and 1TB 7.2K RPM SATA 2.5" hard drive per node.

Courtesy of Ohio State

University MVAPICH team

and X-ScaleSolutions

21


• 1.4x Higher App Performance, MPI Collectives Offload

32 servers, Dual Socket Intel® Xeon® 16-core CPUs E5-2697A V4 @ 2.60 GHz (32 processes per node), NVIDIA

BlueField-2 HDR100 DPUs and ConnectX-6 HDR100 adapters, NVIDIA Quantum Switch QM7800 40-Port 200Gb/s

HDR InfiniBand, 256GB DDR4 2400MHz RDIMMs memory and 1TB 7.2K RPM SATA 2.5" hard drive per node.

Courtesy of Ohio State

University MVAPICH team

and X-ScaleSolutions

22

HPC at Cambridge University

22

EDSAC 1949

650 flops

Darwin 2006

18 TF

CSD3 2021

10 PF

23

11 PF Heterogenous Supercomputer

• Most powerful production HPC system for

UK research over the last 3 years

• Current upgrade

– 80 Quad 80 GB 4-way A100 NVLink Dell

servers, dual HDR200 plus DPU

– 516 Intel Ice lake Dell servers

• Full software defined multi-petaflop

research cloud

• Scientific OpenStack developed, deployed

and supported in collaboration with

StackHPC

• DPU enables full cloud security model over

InfiniBand for new A100 cluster

24

DPU Use Case 1 – Cambridge Secure Clinical HPC Cloud

• HPC system are an attractive

attack surface

• Clinical data needs strong

security model

• Also need to share, collaborate

and have access to large scale

dynamic recourses Dynamic creation of secure HPC data trusts for specific

projects, defined data sets, users, application

Enabling development of novel medical analytics platforms

25

DPU Use Case 2 – I/O & Data Analytics Acceleration

25

SKA – protype system Large streaming, I/O problem

UK fusion modeling Large simulation I/O rates

UKAEA CollaborationMedical imaging

Real time analytics

Cambridge NVMe over IB Data Accelerator

NVMe-oF acceleration

DA functions

Data reduction

InfiniBand DPU

26

DPU Use Case 3 – Accelerating MPI Collectives

26

Materials Modelling Cosmology Plasma Physics

27

The Generation of Secured In-Network Computing HPC

UCX

28

All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information

contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein

Thank You

Gilad Shainer

Documents

Transcript of Gilad Shainer