Post on 25-Jan-2021
THE CONVERGENCE OF HPC AND DEEP LEARNING
Axel Koehler, Principal Solution Architect
HPC$Advisory$Council$2018,$April$10th$$2018,$Lugano
3
FACTORS DRIVING CHANGES IN HPC
End$of$Dennard$Scaling$places$a$cap$on$single$threaded$performance
Increasing$application$performance$will$require$fine$grain$parallel$code$with$significant$computational$intensity
AI$and$Data$Science$emerging$as$important$new$components$of$scientific$discovery
Dramatic$improvements$in$accuracy,$completeness$and$response$time$yield$increased$insight$from$huge$volumes$of$data
Cloud$based$usage$models,$in?situ$execution$and$visualization$emerging$as$new$workflows$critical$to$the$science$process$and$productivity
Tight$coupling$of$interactive$simulation,$visualization,$data$analysis/AI
Service$Oriented$Architectures$(SOA)
4
Multiple Experiments Coming or Upgrading In the Next 10 Years
15$TB/Day
10X$Increase$in$Data$Volume
Exabyte/Day
30X$Increase$in$power
Personal$Genomics
Cryo$EM
5
TESLA PLATFORMONE Data Center Platform for Accelerating HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY FRAMEWORKS & TOOLS
APPLICATIONS
FRAMEWORKS
INTERNET SERVICES
DEEP LEARNING SDK
CLOUDTESLA GPU NVIDIA DGX /DGX-Station NVIDIA HGX-1
ENTERPRISE APPLICATIONS
Manufacturing
Automotive
Healthcare Finance
Retail
Defense…
DeepStream SDKNCCL cuBLAS
cuSPARSEcuDNN TensorRT
ECOSYSTEM TOOLS
HPC
+450 Applications
COMPUTEWORKS
CUDA C/C++ FORTRAN
SYSTEM OEM CLOUDNVIDIA HGX-1
6
GPUS FOR HPC AND DEEP LEARNING
NVIDIA Tesla V100
5120 energy efficient cores + TensorCores7.8 TF Double Precision (fp64), 15.6 TF Single Precision (fp32) , 125 Tensor TFLOP/s mixed-precision
Huge requirement on communication and memory bandwidth NVLink
6 links per GPU a 50 GB/s bi-directional for maximum scalability between GPU’s
CoWoS with HBM2
900 GB/s Memory Bandwidth Unifying Compute & Memory in Single Package
Huge$requirement$on$compute$power$(FLOPS)
NCCL
High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs
GPU Direct / GPU Direct RDMA
Direct communication between GPUs by eliminating the CPU from the critical path
7
TENSOR COREMixed Precision Matrix Math - 4x4 matrices
New CUDA TensorOp instructions & data formats
4x4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Using Tensor cores via
• Volta optimized frameworks and libraries (cuDNN, CuBLAS, TensorRT, ..)
• CUDA C++ Warp Level Matrix Operations
8
0
1
2
3
4
5
6
7
8
9
10
512 1024 2048 4096
Relative2Perform
ance
Matrix2Size2(M=N=K)
cuBLAS Mixed2Precision2(FP162Input,2FP322compute)
P1002(CUDA28)
V1002Tensor2Cores22(CUDA29)
0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
512 1024 2048 4096
Relative2Perform
ance
Matrix2Size2(M=N=K)
cuBLAS Single2Precision2(FP32)
P1002(CUDA28)
V1002(CUDA29)
cuBLAS GEMMS FOR DEEP LEARNINGV100 Tensor Cores + CUDA 9: over 9x Faster Matrix-Matrix Multiply
9.3x1.8x
Note: pre-production Tesla V100 and pre-release CUDA 9. CUDA 8 GA release.
9
COMMUNICATION BETWEEN GPUSLarge scale models:• Some models are too big for a single GPU and need to be spread across multiple devices and multiple nodes• The size of the model will further increase in the future
Data$parallel$training• Each$worker$trains$the$same$layers$on$a$different$data$batch• NVLINK$allows$the$separation$of$data$loading$and$gradient$averaging
Model$parallel$training• All$workers$train$on$same$batchX$workers$communicate$as$frequently$as$network$allows
• NVLINK$allows$the$separation$of$data$loading$and$exchanges$for$activation http://mxnet.io/how_to/multi_devices.html
10
NVLINK AND MULTI-GPU SCALING
PCIe Switch
CPU
PCIe Switch
CPU
0
32
1 5
67
4
• Data loading over PCIe (red)• Gradient averaging over NVLink (blue)• No sharing of communication resources:
No congestion
PCIe Switch
CPU
PCIe Switch
CPU
0
32
1 5
67
4
QPI Link
• Data loading over PCIe• Gradient averaging over PCIe and QPI• Data loading and gradient averaging share
communication resources: Congestion
PCIe based system NVLINK$based system
For Data Parallel Training
11
NVLINK AND CNTK MULTI-GPU SCALING
12
NVIDIA Collective Communications Library (NCCL) 2Multi-GPU and multi-node collective communication primitives
developer.nvidia.com/nccl
High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVink
Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more
Multi-Node:InfiniBand verbs IP Sockets
Multi-GPU:NVLinkPCIe
Automatic Topology Detection
13
NVIDIA DGX-2
1
2$
3
5
4
6 Two Intel Xeon Platinum CPUs
7$$1.5 TB System Memory
13
30 TB NVME SSDs Internal Storage
NVIDIA Tesla V100 32GB
Two GPU Boards8 V100 32GB GPUs per board6 NVSwitches per board512GB Total HBM2 Memoryinterconnected byPlane Card
Twelve NVSwitches2.4 TB/sec bi-section
bandwidth
Eight EDR Infiniband/100 GigE1600 Gb/sec Total Bi-directional Bandwidth
PCIe Switch Complex
8
9
9Dual 10/25 Gb/secEthernet
1414
• 18 NVLINK ports
• @50 GB/s per port bi-directional
• 900 GB/s total bi-directional
• Fully connected crossbar
• X4 PCIe Gen2 Management port
• GPIO
• I2C
• 2 billion transistors
NVSWITCH
15
FULL NON-BLOCKING BANDWIDTH
16
UNIFIED MEMORY PROVIDES• Single memory view shared by all GPUs• Automatic migration of data between GPUs• User control of data locality
NVLINK PROVIDES• All-to-all high-bandwidth peer mapping
between GPUs• Full inter-GPU memory interconnect
(incl. Atomics)
NVSWITCH
VOLTA MULTI-PROCESS SERVICE
Hardware Accelerated
Work Submission
Hardware Isolation
VOLTA MULTI-PROCESS SERVICE
Volta GV100
A B C
CUDA MULTI-PROCESS SERVICE CONTROLCPU Processes
GPU Execution
Volta MPS Enhancements:
• MPS clients submit work directly to the work queues within the GPU
• Reduced launch latency• Improved launch throughput
• Improved isolation amongst MPS clients
• Address isolation with independent address spaces
• Improved quality of service (QoS)
• 3x more clients than Pascal
A B C
Efficient inference deployment without batching system
Single Volta Client,No Batching,
No MPS
VOLTA MPS FOR INFERENCERe
snet
50 Im
ages
/sec
, 7m
s la
tenc
y
Multiple Volta Clients,No Batching,
Using MPS
Volta withBatching System
7x faster
60% of perf with batching
V100 measured on pre-production hardware.
20
DEEP LEARNING IS A HPC WORKLOADHPC expertise is important for success
• HPC and Deep Learning require a huge amount of compute power (FLOPS)
• Mainly Double Precision arithmetic for HPC
• Single, half or 8b precision for Deep Learning Training/Inference
• HPC and Deep Learning are using inherently parallel algorithms
• HPC needs less memory per FLOPS than Deep Learning
• HPC is more demanding on network bandwidth than Deep Learning
• Data scientists like GPU dense systems (as much GPUs as possible per node)
• HPC has more demand for scalability than Deep Learning up to now
• Distributed training frameworks like Horovod (Uber) are meanwhile available
21
• Current DIY deep learning environments are complex and time consuming to build, test and maintain
• Same issues affect HPC and other accelerated applications
• Need multiple jobs from different users to co-exist on the same servers
NVIDIA Libraries
NVIDIA Docker
NVIDIA Driver
NVIDIA GPU
Open Source Frameworks
SOFTWARE CHALLENGES
22
NVIDIA GPU CLOUD REGISTRY
Deep LearningAll major frameworks with multi-GPU optimizations Uses NCCL for NVLINK data exchange Multi-threaded I/O to feed the GPUs
Caffe, Caffe2,CNTK, mxnet, PyTorch, Tensorflow, Theano, Torch
HPCNAMD, Gromacs, LAMMPS, GAMESS, Relion, Chroma, MILC
HPC VisualizationParaview with Optix, Index and Holodeck with OpenGL visualization base on NVIDIA Docker 2.0, IndeX, VMD
Single NGC AccountFor use on GPUs everywhere - https://ngc.nvidia.com
Common Software stack across NVIDIA GPUs
NVIDIA GPU Cloud containerizes GPU-optimized frameworks, applications, runtimes, libraries, and operating system, available at no charge
23
NVIDIA SATURN VAI supercomputer with 660 x DGX-1V
40$PF$Peak$FP64$Performance$,$660$PF$DL$Tensor$Performance
• Primarily research focused• Used internally for Deep Learning applied
research
• Many using testing algorithms, networks, new approaches
• Embedded, robotic, auto, hyperscale, HPC• Partner with university research and industry
collaborations
• Study convergence of data science and HPC• All jobs are containerized
24
DEEP LEARNING DATA CENTERReference Architecture
http://www.nvidia.com/object/dgx1?multi?node?scaling?whitepaper.html
25
COMBINING THE STRENGTHS OF HPC AND AI
• Implement$inference$models$with$real$time$interactivity$
• Train$inference$models$to$improve$accuracy$and$comprehend$more$of$the$physical$parameter$space
• Analyze$data$sets$that$are$simply$intractable$with$classic$statistical$models
• Control$and$manage$complex$scientific$experiments
HPC
• Proven$algorithms$based$on$first$principles$theory• Proven$statistical$models$for$accurate$results$in$multiple$science$domains
• Develop$training$data$sets$using$first$principal$models
• Incorporate$AI$models$in$semi?empirical$style$applications$to$improve$throughput
• Validate$new$findings$from$AI
• New$methods$to$improve$predictive$accuracy,$insight$into$new$phenomena$and$response$time
AI
26
MULTI-MESSENGERASTROPHYSICS
Despite2the2latest2development2in2computational2power,2there2is2still2a2large2gap2in2linking2relativistic2theoretical2models2to2observations.2Max$Plank$Institute
BackgroundThe aLIGO (Advanced Laser Interferometer Gravitational Wave Observatory) experiment successfully discovered signals proving Einstein’s theory of General Relativity and the existence of cosmic Gravitational Waves. While this discovery was by itself extraordinary it is seen to be highly desirable to combine multiple observational data sources to obtain a richer understanding of the phenomena.
ChallengeThe initial a LIGO discoveries were successfully completed using classic data analytics. The processing pipeline used hundreds of CPU’s where the bulk of the detection processing was done offline. Here the latency is far outside the range needed to activate resources, such as the Large Synaptic Space survey Telescope (LSST) which observe phenomena in the electromagnetic spectrum in time to “see” what aLIGO can “hear”.
SolutionA DNN was developed and trained using a data set derived from the CACTUS simulation using the Einstein Toolkit. The DNN was shown to produce better accuracy with latencies 1000x better than the original CPU based waveform detection.
ImpactFaster and more accurate detection of gravitational waves with the potential to steer other observational data sources.
27
BackgroundDeveloping a new drug costs $2.5B and takes 10-15 years. Quantum chemistry (QC) simulations are important to accurately screen millions of potential drugs to a few most promising drug candidates.
ChallengeQC simulation is computationally expensive so researchers use approximations, compromising on accuracy. To screen 10M drug candidates, it takes 5 years to compute on CPUs.
SolutionResearchers at the University of Florida and the University of North Carolina leveraged GPU deep learning to develop ANAKIN-ME, to reproduce molecular energy surfaces with super speed (microseconds versus several minutes), extremely high (DFT) accuracy, and at 1-10/millionths of the cost of current computational methods.
ImpactFaster, more accurate screening at far lower cost
AI Quantum Breakthrough
28
SUMMARY• Same GPU technology enabling powerful science is also enabling
the revolution in deep learning
• Deep learning is enabling many usages in science (eg. Image recognition, classification, ..)
• Applications can use DL to train neural networks with already simulated data and DL network can predict about the output
• GPU is the right technology for HPC and DL
Axel Koehler (akoehler@nvidia.com)
THE CONVERGENCE OF HPC AND DEEP LEARNING