Leadership AI Computing - hpcadvisorycouncil.com
Transcript of Leadership AI Computing - hpcadvisorycouncil.com
AI-DRIVEN MULTISCALE SIMULATIONLargest AI+MD Simulation Ever
DOCKING ANALYSIS58x Full Pipeline Speedup
DEEPMD-KIT1,000x Speedup
SQUARE KILOMETER ARRAY250GB/s Data Processed End-to-End
AI SUPERCOMPUTING DELIVERING SCIENTIFIC BREAKTHROUGHS7 of 10 Gordon Bell Finalists Used NVIDIA AI Platforms
CLIMATE (1.12 EF)LBNL | NVIDIA
GENOMICS (2.36 EF)ORNL
NUCLEAR WASTE REMEDIATION (1.2 EF)LBNL | PNNL | Brown U. | NVIDIA
CANCER DETECTION (1.3 EF)ORNL | Stony Brook U.
EXASCALE AI SCIENCE
FUSING SIMULATION + AI + DATA ANALYTICSTransforming Scientific Workflows Across Multiple Domains
PREDICTING EXTREME WEATHER EVENTS
5-Day Forecast with 85% Accuracy SimNet (PINN) for CFD
18,000x SpeedupRAPIDS FOR SEISMIC ANALYSIS
260X Speedup Using K-Means
FERMILAB USES TRITON TO SCALE DEEP LEARNING INFERENCE IN HIGH ENERGY PARTICLE PHYSICSGPU-Accelerated Offline Neutrino Reconstruction Workflow
Neutrino Event Classification from Reconstruction
400TB Data from Hundreds of Millions of Neutrino Events
17x Speedup of DL Model on T4 GPU Vs. CPU
Triton in Kubernetes Enables “DL Inference” as a Service
Expected to Scale to Thousands of Particle Physics Client Nodes
NETWORK
EDGE APPLIANCE
SUPERCOMPUTING
EXTREME IO
DATA
ANALYTICS
EDGE
STREAMING
CLOUD
SIMULATION
VISUALIZATION
AI
EXPANDING UNIVERSE OF SCIENTIFIC COMPUTING
THE ERA OF EXASCALE AI SUPERCOMPUTINGAI is the New Growth Driver for Modern HPC
HPL: Based on #1 system in June Top500AI: Peak system FP16 FLOPS
2017 2018 2019 2020 2021 20222018 2019 2020 2021
10000
1000
100
Perlmutter
Summit
FugakuSierra
JUWELS
Leonardo
AI
HPL
HPL VS. AI PERFORMANCE
PFLO
PS
Chemical Compounds
NLP
GENOMICS STRUCTURE DOCKING SIMULATION IMAGING
O(109)
O(1060)
Literature
Drug
Candidate
Biological
Target
Real World Data
SEARCH
FIGHTING COVID-19 WITH SCIENTIFIC COMPUTING$1.25 Trillion Industry | $2B R&D per Drug | 12 ½ Years Development | 90% Failure Rate
O(102)
AutoDock
RAPIDS
RAPIDS
SchrodingerNAMD, VMD, OpenMM
MELD
Clara Imaging
MONAI
CryoSPARC, Relion
AlphaFold
Clara Parabricks
RAPIDS
BioMegatron, BioBERT
Chemical Compounds
NLP
GENOMICS STRUCTURE DOCKING SIMULATION IMAGING
O(109)
O(1060)
Literature
Drug
Candidate
Biological
Target O(102)
Real World Data
SEARCH
FIGHTING COVID-19 WITH NVIDIA CLARA DISCOVERY
2010 2015 2020 2025
175
Zettabytes
58
Zettabytes
EXPLODING DATA AND MODEL SIZE
EXPLODING MODEL SIZE
Driving Superhuman CapabilitiesBIG DATA GROWTH
90% of the World’s Data in last 2 YearsGROWTH IN SCIENTIFIC DATA
Fueled by Accurate Sensors & Simulations
Source for Big Data Growth chart: IDC – The Digitization of the World (May, 2020)
2017 2018 2019 2020 2021
GPT-3
175 Bn
2017 2018 2019 2020
Turing-NLG
17 Bn
BERT
340 M
Transformer
65 M
GPT-2 8B
8.3 Bn
# P
ara
mete
rs (
Log s
cale
)
287 TB/day
ECMWF
393 TB
COVID-19 Graph Analytics
16 TB/sec
SKA
550 TB
NASA Mars Landing Sim.
AI SUPERCOMPUTING NEEDS EXTREME IO
System Memory
CPU
PCIe Switch
NIC
NODE A
200GB/s
System Memory
CPU
PCIe Switch
NODE B
NIC
8X
GPU GPU
8X
System Memory
CPU
PCIe Switch
NIC
NODE A
200GB/s Storage
8X
GPU
GPUDIRECT STORAGEGPUDIRECT RDMA
SHARP IN-NETWORK COMPUTING
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
IB Network
NODES 0-15
NODE 0-15 REDUCTIONS ∑
CUDA-X
CUDA
AI DRIVEMETRO ISAACCLARARAPIDS AERIALRTX HPC
MAGNUM IO
STORAGE IO NETWORK IOIN-NETWORK
COMPUTEIO MANAGMENT
2X DL Inference Performance
10X IO Performance | 6X Lower CPU Utilization
IO OPTIMIZATION IMPACT
IMPROVING SIMULATION OF PHOSPHORUS MONOLAYER WITH LATEST NCCL P2P
SEGMENTING EXTREME WEATHER PHENOMENA IN CLIMATE SIMULATIONS
REMOTE FILE READS AT PEAKFABRIC BANDWIDTH ON DGX A100
NUMPY DALI DALI+GDS WITHOUT GDS WITH GDSMPI NCCL NCCL+P2P
1X
1.24X
1.54X
1X
2.4X
2.9X
1X
3.8X
16
SELENE
DGX SuperPOD Deployment
#1 on MLPerf for commercially available systems
#5 on TOP500 (63.46 PetaFLOPS HPL)
#5 on Green500 (23.98 GF/watt) - #1 on Green500 (26.2 GF/W) - single scalable unit
#4 on HPCG (1.6 PetaFLOPS)
#3 on HPL-AI (250 PetaFLOPS)
Fastest Industrial System in U.S. — 3+ ExaFLOPS AI
Built with NVIDIA DGX SuperPOD Architecture
• NVIDIA DGX A100 and NVIDIA Mellanox IB
• NVIDIA’s decade of AI experience
Configuration:
• 4480 NVIDIA A100 Tensor Core GPUs
• 560 NVIDIA DGX A100 systems
• 850 Mellanox 200G HDR IB switches
• 14 PB of all-flash storage
17
LESSONS LEARNEDHow to Build and Deploy HPC Systems
with Hyperscale Sensibilities
Speed and feed matching
Thermal and power design
Interconnect design
Deployability
Operability
Flexibility
Expandability
23
A NEW GENERATION OF SYSTEMSNVIDIA DGX A100
GPUs 8x NVIDIA A100
GPU Memory 640 GB total
Peak performance 5 petaFLOPS AI | 10 petaOPS INT8
NVSwitches 6
System Power Usage 6.5kW max
CPUDual AMD Rome 7742
128 cores total, 2.25 GHz(base), 3.4GHz (max boost)
System Memory 2TB
Networking
8x Single-Port Mellanox ConnectX-6 200Gb/s HDR
Infiniband (Compute Network)
2x Dual-Port Mellanox ConnectX-6 200Gb/s HDR
Infiniband (Storage Network also used for Eth*)
StorageOS: 2x 1.92TB M.2 NVME drives
Internal Storage: 15TB (4x 3.84TB) U.2 NVME drives
Software Ubuntu Linux OS (5.3+ kernel)
System Weight 271 lbs (123 kgs)
Packaged System Weight 315 lbs (143 kgs)
Height 6U
Operating temp range 5°C to 30°C (41°F to 86°F)
* Optional upgrades
26
DGX SUPERPODModular Architecture
1K GPU SuperPOD Cluster
• 140 DGX A100 nodes (1,120 GPUs) in a GPU POD
• 1st tier fast storage - DDN AI400x with Lustre
• Mellanox HDR 200Gb/s InfiniBand - Full Fat-tree
• Network optimized for AI and HPC
DGX A100 Nodes
• 2x AMD 7742 EPYC CPUs + 8x A100 GPUs
• NVLINK 3.0 Fully Connected Switch
• 8 Compute + 2 Storage HDR IB Ports
A Fast Interconnect
• Modular IB Fat-tree
• Separate network for Compute vs Storage
• Adaptive routing and SharpV2 support for offload
Distributed Core Switches
1K GPU POD
Spine Switches
Leaf Switches
Storage…
…
GPUPOD
Distributed Core Switches
Storage Spine Switches
Storage Leaf Switches
DGX A100
#140
DGX A100
#1
27
DGX SUPERPODExtensible Architecture
POD to POD
• Modular IB Fat-tree or DragonFly+
• Core IB Switches Distributed Between PODs
• Direct connect POD to POD
Distributed Core Switches
1K GPU POD
Spine Switches
Leaf Switches
Storage…
…
GPUPOD
GPUPOD
GPUPOD
GPUPOD
Distributed Core Switches
Storage Spine Switches
Storage Leaf Switches
DGX A100
#140
DGX A100
#1
28
Designed with Mellanox 200Gb HDR IB network
Separate compute and storage fabric8 Links for compute
2 Links for storage (Lustre)
Both networks share a similar fat-tree design
Modular POD design140 DGX A100 nodes are fully connected in a SuperPOD
SuperPOD contains compute nodes and storage
All nodes and storage are usable between SuperPODs
Sharpv2 optimized designLeaf and Spines organized in HCA planes
For a SuperPOD, all HCA1 from 140 DGX-2 connect to a
HCA1 Plane fat-tree network
Traffic from HCA1 to HCA1 between any two nodes in a
POD stay either at the Leaf or Spine level
Only use core switches when - Moving data between HCA planes (e.g. mlx5_0 to
mlx5_1 in another system)
- Moving any data between SuperPODs
Distributed Core Switches
MULTI NODE IB COMPUTEThe Details
29
DESIGNING FOR PERFORMANCEIn the Data Center
Scalable Unit (SU)
All design is based on a radix optimized approach for
Sharpv2 support and fabric performance and to align
with design of Mellanox Quantum switches.
30
SHARPHDR200 Selene Early Results
128 NVIDIA DGX A100
(1024 GPUs, 1024 InfiniBand Adapters
NCCL AllReduce Performance Increase with SHARP
3.0
2.0
1.0
0.0
Message Size
Pe
rfo
rma
nce
In
cre
ase
Fa
cto
r
31
STORAGE
Per SuperPOD:
Fast Parallel FS: Lustre (DDN)
- 10 DDN AI400X Units
- Total Capacity: 2.5 PB
- Max Perf Read/Write: 490/250 GB/s
- 80 HDR-100 cables required
- 16.6KW
Shared FS: Oracle ZFS5-2
- HA Controller Pair/768GB total
- 8U Total Space (4U per Disk Shelf, 2U per controller)
- 76.8 TB Raw - 24x3.2TB SSD
- 16x40GbE
- Key features: NFS, HA, snapshots, dedupe
- 2kW
Parallel filesystem for perf and
NFS for home directories
32
STORAGE HIERARCHY
• Memory (file) cache (aggregate): 224TB/sec - 1.1PB (2TB/node)
• NVMe cache (aggregate): 28TB/Sec - 16PB (30TB/node)
• Network filesystem (cache - Lustre): 2TB/sec - 10PB
• Object storage: 100GB/sec - 100+PB
34
• Deep Learning Model:
• Hyperparameters tuned for multi-node scaling
• Multi-node launcher scripts
• Deep Learning Container:
• Optimized TensorFlow, GPU libraries, and multi-node software
• Host:
• Host OS, GPU driver, IB driver, container runtime engine (docker, enroot)
SCALE TO MULTIPLE NODESSoftware Stack - Application
35
• Slurm: User job scheduling & management
• Enroot: NVIDIA open-source tool to convert traditional container/OS images into unprivileged sandboxes
• Pyxis: NVIDIA open-source plugin integrating Enroot with Slurm
• DeepOps: NVIDIA open-source toolbox for GPU cluster management w/Ansible playbooks
SCALE TO MULTIPLE NODESSoftware Stack - System
Login nodes DGX Pod: DGX Servers w. DGX base OS
Slurm controller Enroot | DockerPyxis
NGC model containers (Pytorch, Tensorflow from 19.09)
DCGM
36
INTEGRATING CLUSTERS IN THE DEVELOPMENT WORKFLOW
• Integrating DL-friendly tools like GitLab, Docker w/ HPC systems
Kick off 10000’s of GPU hours of tests with a single button click in GitLab
… build and package with Docker
… schedule and prioritize with SLURM
… on demand or on a schedule
… reporting via GitLab, ELK stack, Slack, email
Emphasis on keeping things simple for users while hiding integration complexity
Ensure reproducibility and rapid triage
Supercomputer-scale CI (Continuous integration internal at NVIDIA)
38
RESOURCES
GTC Sessions (https://www.nvidia.com/en-us/gtc/session-catalog/) :
Under the Hood of the new DGX A100 System Architecture [S21884]
Inside the NVIDIA Ampere Architecture [S21730]
CUDA New Features And Beyond [S21760]
Inside the NVIDIA HPC SDK: the Compilers, Libraries and Tools for Accelerated Computing [S21766]
Introducing NVIDIA DGX A100: the Universal AI System for Enterprise [S21702]
Mixed-Precision Training of Neural Networks [S22082]
Tensor Core Performance on NVIDIA GPUs: The Ultimate Guide [S21929]
Developing CUDA kernels to push Tensor Cores to the Absolute Limit on NVIDIA A100 [S21745]
HotChips:
Hot Chips Tutorial - Scale Out Training Experiences – Megatron Language Model
Hot Chips Session - NVIDIA’s A100 GPU: Performance and Innovation for GPU Computing
Pyxis/Enroot https://fosdem.org/2020/schedule/event/containers_hpc_unprivileged/
Presentations
39
RESOURCES
DGX A100 Page https://www.nvidia.com/en-us/data-center/dgx-a100/
Blogs
DGX SuperPOD https://blogs.nvidia.com/blog/2020/05/14/dgx-superpod-a100/
DDN Blog for DGX A100 Storage https://www.ddn.com/press-releases/ddn-a3i-nvidia-dgx-a100/
Kitchen Keynote summary https://blogs.nvidia.com/blog/2020/05/14/gtc-2020-keynote/
Double Precision Tensor Cores https://blogs.nvidia.com/blog/2020/05/14/double-precision-tensor-cores/
Links and other doc