Gilad Shainer
Transcript of Gilad Shainer
Cloud Native Supercomputing
Gilad Shainer
2
The New Universe of Scientific Computing
3
Traditional HPC Data Center
COMPUTE
Usage (
App
/ C
om
m)
Pin
g L
ate
ncy
HP
C F
ram
ew
ork
Late
ncy
4
RDMA-Accelerated HPCU
sage (
App
/ C
om
m)
Pin
g L
ate
ncy
HP
C F
ram
ew
ork
Late
ncy
COMPUTE
5
GPUDirect RDMA Example
6
In-Network Computing-Accelerated HPC U
sage (
App
/ C
om
m)
Pin
g L
ate
ncy
HP
C F
ram
ew
ork
Late
ncy
COMPUTE
7
In-Network Computing Accelerated Supercomputing
• Software-Defined, Hardware-Accelerated
• InfiniBand In-Network Computing Examples:
– Data reductions (SHARP)
– All-to-All
– MPI Tag-Matching
7X Better MPI Latency 2.5X Better AI Performance
8
Data Aggregation Protocol
1 2 3 4 5 6 7 8
3-Level Fat Tree Network
Nodes
Switch Network
9
Traditional Data Aggregation
1 2 3 4 5 6 7 8 Nodes
Switch Network
Phase 1
10
Traditional Data Aggregation
1 2 3 4 5 6 7 8 Nodes
Switch Network
Phase 2
11
Traditional Data Aggregation
1 2 3 4 5 6 7 8 Nodes
Switch Network
Phase 3
12
SHARP In-Network Computing Data Aggregation
A
A A
AAAA
1 2 3 4 5 6 7 8 Nodes
Switch Network
13
Data Aggregation - Comparison
Low latencyOptimized data motion
No CPU latency addition
High latencyHigh amount of transferred data
CPU overhead
Traditional In-Network Computing
14
The Generation of Secured In-Network Computing HPC
CPU
Usage (
App
/ C
om
m)
Pin
g L
ate
ncy
HP
C F
ram
ew
ork
Late
ncy
15
Data Processing Units (DPUs)
• Data Center Infrastructure on a Chip
• Networking, storage, security services
• DPU architecture isolates data center
security policies from the host CPU
while building a zero-trust data center
domain at the edge of every server
• Enabling Cloud Native Supercomputing:
bare metal performance, secured HPC
16
Cloud Native Supercomputing Architecture
17
Multi Tenant Isolation
• Zero trust architecture
• Secured Network infrastructure and configuration
• Storage virtualization
• Tenant Service Level Agreement (SLA)
• 10Ks concurrent isolated users on single subnet
18
Higher Application Performance
• DPU Accelerated HPC Communications
• Collective offload with UCC accelerator
• Active messages
• Smart MPI progression
• Data compression
• User-defined algorithms
19
Collective Offload
• Non-blocking collective operations are offloaded to a set
of DPU ‘Worker’ processes
• Host processes prepare a set of metadata and provide it to
the DPU Worker processes
• DPU Worker processes access host memory via RDMA
• DPU Worker processes progress the collective on behalf
of the host processes
• DPU Worker processes notify the host processes once
message exchanges are completed
DPU DPU
HOST HOST
MEMORY MEMORY
RDMAREAD
RDMAWRITE
20
Higher Application Performance
• 100% Communication – Computation Overlap
32 servers, Dual Socket Intel® Xeon® 16-core CPUs E5-2697A V4 @ 2.60 GHz (32 processes per node), NVIDIA
BlueField-2 HDR100 DPUs and ConnectX-6 HDR100 adapters, NVIDIA Quantum Switch QM7800 40-Port 200Gb/s
HDR InfiniBand, 256GB DDR4 2400MHz RDIMMs memory and 1TB 7.2K RPM SATA 2.5" hard drive per node.
Courtesy of Ohio State
University MVAPICH team
and X-ScaleSolutions
21
Higher Application Performance
• 1.4x Higher App Performance, MPI Collectives Offload
32 servers, Dual Socket Intel® Xeon® 16-core CPUs E5-2697A V4 @ 2.60 GHz (32 processes per node), NVIDIA
BlueField-2 HDR100 DPUs and ConnectX-6 HDR100 adapters, NVIDIA Quantum Switch QM7800 40-Port 200Gb/s
HDR InfiniBand, 256GB DDR4 2400MHz RDIMMs memory and 1TB 7.2K RPM SATA 2.5" hard drive per node.
Courtesy of Ohio State
University MVAPICH team
and X-ScaleSolutions
22
HPC at Cambridge University
22
EDSAC 1949
650 flops
Darwin 2006
18 TF
CSD3 2021
10 PF
23
11 PF Heterogenous Supercomputer
• Most powerful production HPC system for
UK research over the last 3 years
• Current upgrade
– 80 Quad 80 GB 4-way A100 NVLink Dell
servers, dual HDR200 plus DPU
– 516 Intel Ice lake Dell servers
• Full software defined multi-petaflop
research cloud
• Scientific OpenStack developed, deployed
and supported in collaboration with
StackHPC
• DPU enables full cloud security model over
InfiniBand for new A100 cluster
24
DPU Use Case 1 – Cambridge Secure Clinical HPC Cloud
• HPC system are an attractive
attack surface
• Clinical data needs strong
security model
• Also need to share, collaborate
and have access to large scale
dynamic recourses Dynamic creation of secure HPC data trusts for specific
projects, defined data sets, users, application
Enabling development of novel medical analytics platforms
25
DPU Use Case 2 – I/O & Data Analytics Acceleration
25
SKA – protype system Large streaming, I/O problem
UK fusion modeling Large simulation I/O rates
UKAEA CollaborationMedical imaging
Real time analytics
Cambridge NVMe over IB Data Accelerator
NVMe-oF acceleration
DA functions
Data reduction
InfiniBand DPU
26
DPU Use Case 3 – Accelerating MPI Collectives
26
Materials Modelling Cosmology Plasma Physics
27
The Generation of Secured In-Network Computing HPC
UCX
28
All trademarks are property of their respective owners. All information is provided “As-Is” without any kind of warranty. The HPC Advisory Council makes no representation to the accuracy and completeness of the information
contained herein. HPC Advisory Council undertakes no duty and assumes no obligation to update or correct any information presented herein
Thank You