SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing...
Transcript of SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing...
![Page 1: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/1.jpg)
Pedro Mario Cruz e SilvaSolutions Architect Manager, Latin América | Global Energy Team
SCALING DEEP LEARNING TO EXASCALEACM GORDON BELL PRIZE 2018
![Page 2: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/2.jpg)
2
200B CORE HOURS OF LOST SCIENCEData Center Throughput is the Most Important Thing for HPC
Source: NSF XSEDE Data: https://portal.xsede.org/#/galleryNU = Normalized Computing Units are used to compare compute resources across supercomputers and are based on the result of the High Performance LINPACK benchmark run on each system
0
50
100
150
200
250
300
350
400
2009 2010 2011 2012 2013 2014 2015
Computing Resources Requested
Computing Resources Available
Norm
alized U
nit
(Billions)
National Science Foundation (NSF XSEDE) Supercomputing Resources
![Page 3: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/3.jpg)
3
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
![Page 4: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/4.jpg)
4
1
10
100
1000
Mar-12 Mar-13 Mar-14 Mar-15 Mar-16 Mar-17 Mar-18
Re
lati
ve
Pe
rfo
rm
an
ce
Mar-19
2013
BEYOND MOORE’S LAW
Base OS: CentOS 6.2
Resource Mgr: r304
CUDA: 5.0
Thrust: 1.5.3
2019
Accelerated Server
With FermiAccelerated Server
with Volta
NPP: 5.0
cuSPARSE: 5.0
cuRAND: 5.0
cuFFT: 5.0
cuBLAS: 5.0
Base OS: Ubuntu 16.04
Resource Mgr: r384
CUDA: 10.0
NPP: 10.0
cuSPARSE: 10.0
cuSOLVER: 10.0
cuRAND: 10.0
cuFFT: 10.0
cuBLAS: 10.0
Thrust: 1.9.0
Progress Of Stack In 6 Years
GPU-Accelerated Computing
CPU
Moore’s Law
2013 2014 2015 2016 2017 2018 2019March
Rela
tive P
erf
orm
ance
![Page 5: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/5.jpg)
5
DIGITAL SCIENCEHPC + AI + DATA
![Page 6: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/6.jpg)
6
FUSION OF HPC & AI
HPC AI
VOLTA TENSOR CORE GPU
GPU FUSES HPC & AI COMPUTING
MULTI-PRECISION COMPUTING
HPC (Simulation) – FP64, FP32
AI (Deep Learning) – FP16, INT8
![Page 7: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/7.jpg)
7
AI – A NEW INSTRUMENT FOR SCIENCE
AI> Neural Networks that learn patterns
from large data sets
> Improve predictive accuracy and faster
response time.
Dramatically Improves Accuracy and Time-to-Solution
HPC> Algorithms based on first principles
theory.
> Proven models for accurate results
Commercially
viable fusion
energy
Understanding
cosmological dark
energy and matter
Clinically viable
precision medicine
Improvement and
validation of the Standard
Model of Physics
Climate/weather
forecasts with ultra-
high fidelity
![Page 8: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/8.jpg)
8
“ACCELERATING EULERIAN FLUID SIMULATION WITH CONVOLUTIONAL NETWORKS”
Tompson, J., Schlachter, K., Sprechmann, P., & Perlin, K. (2016). Accelerating Eulerian Fluid Simulation With Convolutional Networks. arXiv preprint arXiv:1607.03597.
![Page 9: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/9.jpg)
9
![Page 10: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/10.jpg)
10
AI FOR SCIENCETransformative Tool To Accelerate The Pace of Scientific Innovation
Improves AccuracyEnabling realization of full scientific potential
Accelerates Time to SolutionUnlocking the use of science in exciting new ways
300,000X FasterPredict Molecular Energetics
Drug Discovery
5,000X FasterProcess LIGO Signal
Understanding Universe
Weeks to 10 milliseconds Analyze Gravitational Lensing
Astrophysics
14X FasterGenerate Bose-Einstein Condensate (Physics)
90% accuracy Fusion Sustainment
Clean Energy
33% FasterTrack NeutrinosParticle Physics
70% accuracy Score Protein Ligand
Drug Discovery
11% higher accuracy Monitor Earth’s Vital
Climate
![Page 11: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/11.jpg)
11
THE PROBLEM
![Page 12: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/12.jpg)
12
IMAGE SEGMENTATIONPattern Detection for Characterizing Extreme Weather
Atmospheric rivers (ARs) are labeled in blue, while tropical cyclones (TCs) are labeled in red
![Page 13: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/13.jpg)
13
CLIMATE DATASET AND GROUND TRUTH LABELS
0.25-degree Community Atmosphere Model (CAM5) output
Climate variables are stored on an 1152x768spatial grid, with a temporal resolution of 3 hours
All available 16 variables (water vapor, wind, precipitation, temperature, pressure, etc).
Process climate model output with the Toolkit for Extreme Climate Analysis to identify TCs.
A floodfill algorithm is used to create spatial masks of ARs
There are about 63K high-resolution samples in total
Split into 80% training, 10% test and 10% validation sets
The pixel mask labels correspond to 3 classes:1) Tropical Cyclone (TC)2) Atmospheric River (AR)3) Background (BG)
Climate data used is currently 3.5 TB
![Page 14: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/14.jpg)
14
THE TEAM
![Page 15: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/15.jpg)
15
NERSC & NVIDIA
![Page 16: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/16.jpg)
16
![Page 17: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/17.jpg)
17
HARDWARE
![Page 18: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/18.jpg)
18
NVIDIA POWERS TODAY’S FASTEST SUPERCOMPUTERS
22 of Top 25 Greenest
Piz DaintEurope’s Fastest
5,704 GPUs| 21 PF
ORNL SummitWorld’s Fastest
27,648 GPUs| 149 PF
Total Pangea 3Fastest Industrial
3,348 GPUs| 18 PF
ABCIJapan’s Fastest
4,352 GPUs| 20 PF
LLNL SierraWorld’s 2nd Fastest
17,280 GPUs| 95 PF
![Page 19: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/19.jpg)
19
NVIDIA POWERS WORLD'S FASTEST SUPERCOMPUTER
27,648Volta Tensor Core GPUs
Summit Becomes First System To Scale The 100 Petaflops Milestone
122 PF 3 EFHPC AI
![Page 20: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/20.jpg)
20
IBM AC9226xV100 + 2xP9 (Water Cooled)
![Page 21: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/21.jpg)
TESLA V100 TENSOR CORE GPUWorld’s Most Powerful Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink
![Page 22: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/22.jpg)
22
TENSOR CORE4x4x4 matrix multiply and accumulate
![Page 23: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/23.jpg)
23
TENSOR CORES FOR SCIENCEMulti-precision computing
AI-POWERED WEATHER PREDICTION
PLASMA FUSION APPLICATION EARTHQUAKE SIMULATION
7.815.7
125
0
20
40
60
80
100
120
140
V100 TFLOPS
FP64+ MULTI-PRECISION
FP16 Solver
3.5x times faster
FP16/FP32
1.15x ExaOPS
FP16-FP21-FP32-FP64
25x times faster
![Page 24: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/24.jpg)
24
SOFTWARE:PERFORMANCE &
PRDUCTIVITY
![Page 25: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/25.jpg)
25
POWERING THE DEEP LEARNING ECOSYSTEMNVIDIA SDK Accelerates Every Major Framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
NVIDIA DEEP LEARNING SDK and CUDA
developer.nvidia.com/deep-learning-software
![Page 26: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/26.jpg)
26
![Page 27: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/27.jpg)
27
DEEP LEARNING
GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your Applications
LINEAR ALGEBRA PARALLEL ALGORITHMS
SIGNAL, IMAGE & VIDEO
TensorRT
nvGRAPH NCCL
cuBLAS
cuSPARSE cuRAND
DeepStream SDK NVIDIA NPPcuFFT
CUDA
Math library
cuSOLVER
CODEC SDKcuDNN
![Page 28: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/28.jpg)
28
NVIDIA DEEP LEARNING SDK
Powerful tools and libraries for designing and deploying GPU-accelerated deep learning applications
High performance building blocks for training and deploying deep neural networks on NVIDIA GPUs
Industry vetted deep learning algorithms and linear algebra subroutines for developing novel deep neural networks
Multi-GPU and multi-node scaling that accelerates training on up to eight GPU
High Performance GPU-acceleration for Deep Learning
developer.nvidia.com/deep-learning-software
Deep Learning Primitives
Multi-GPU Communication
Linear Algebra
Programmable Inference Accelerator
Sparse Matrix Operations
Deep Learning for Video Analytics
![Page 29: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/29.jpg)
29
NVIDIA cuDNNDeep Learning Primitives
developer.nvidia.com/cudnn
High performance building blocks for deep learning frameworks
Drop-in acceleration for widely used deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, PyTorch, Tensorflow and others
Accelerates industry vetted deep learning algorithms, such as convolutions, LSTM RNNs, fully connected, and pooling layers
Fast deep learning training performance tuned for NVIDIA GPUs
“ NVIDIA has improved the speed of cuDNN
with each release while extending the
interface to more operations and devices
at the same time.”
— Evan Shelhamer, Lead Caffe Developer, UC Berkeley
0
2,000
4,000
6,000
8,000
10,000
12,000
8x K80 8x Maxwell DGX-1 DGX-1V
Images/
Second
cuDNN 7
NCCL 2
cuDNN 6
NCCL 1.6
cuDNN 4
cuDNN 2
Deep Learning Training Performance
![Page 30: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/30.jpg)
30
NVIDIA COLLECTIVE COMMUNICATIONS LIBRARY (NCCL)Multi-GPU and Multi-node Collective Communication Primitives
developer.nvidia.com/nccl
Open-source High-performance multi-GPU and multi-node collective communication primitives optimized for NVIDIA GPUs
Fast routines for multi-GPU multi-node acceleration that maximizes inter-GPU bandwidth utilization
Easy to integrate and MPI compatible. Uses automatic topology detection to scale HPC and deep learning applications over PCIe and NVLink
Accelerates leading deep learning frameworks such as Caffe2, Microsoft Cognitive Toolkit, MXNet, PyTorch and more
Multi-Node:
InfiniBand verbs,
IP Sockets
Multi-GPU:
NVLink, PCIe
Automatic
Topology
Detection
Preferred NetworksFeb '17
128 TitanX(Maxwell)
FacebookJune '17
256 TeslaP100
IBMAug '17
256 TeslaP100
Preferred NetworksNov '17
1024 TeslaP100
TencentJul'18
2048 TeslaP40
Tra
inin
g t
ime
Scaling training to 2048 GPUs
4.4 Hours
48 Mins60 Mins
15 Mins
ResNet-50 | Dataset: Imagenet | Trained for 90 Epochs
6.6 Mins
![Page 31: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/31.jpg)
31
HOROVOD (UBER)Horovod: fast and easy distributed deep learning in TensorFlow
Sergeev, Alexander, and Mike Del Balso. "Horovod: fast and easy distributed deep learning in TensorFlow." arXiv preprint arXiv:1802.05799 (2018).
![Page 32: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/32.jpg)
32
FULLY CONVOLUTIONAL NETWORKS (FCN)“Fully Convolutional Networks for Semantic Segmentation”,
Shellhammer et al, 2015
![Page 33: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/33.jpg)
33NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
SEMANTIC SEGMENTATIONFCN vs SDS
![Page 34: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/34.jpg)
34
DEEP NEURAL NETWORKSTiramisu (left) and DeepLabv3+ (right)
![Page 35: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/35.jpg)
35
INNOVATIONS (1)
• Weighted loss
• Layer-wise adaptive rate control (LARC)
• Multi-channel segmentation
• Gradient lag
• Modifications to the neural network architectures
Deep Learning Innovations
![Page 36: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/36.jpg)
36
INNOVATIONS (2)
• High speed parallel data staging
• Optimized data ingestion pipeline
• Hierarchical all-reduce
System Innovations
![Page 37: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/37.jpg)
37
INNOVATIONS (2)
• High speed parallel data staging
• Read 1500 images per node (250 per GPU)
• 800GB of high-speed SSD storage on each node
• Distributed data staging system that first divides the data set into disjoint pieces, read to some nodes, than copy to other nodes
System Innovations
![Page 38: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/38.jpg)
38
INNOVATIONS (2)
• Optimized data ingestion pipeline
• TensorFlow input pipeline that reads the input files and converts them into the tensors that are fed through the network (read and convert to TFRecords)
• Eliminate serialization by enabling the prefetching option of TensorFlow datasets
• HDF5 Library serializes calls
• By using the Python multiprocessing module, transform these parallel worker threads into parallel worker processes, each using its own instance of the HDF5 library
System Innovations
![Page 39: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/39.jpg)
39
INNOVATIONS (2)
• Hierarchical all-reduce
• Horovod is a Python module that uses MPI to transform a single-process TensorFlow application into a data-parallel implementation
• Each MPI rank creates its own identical copy of the TensorFlow operation graph.
• The first issue was a bottleneck on the first rank, which acts as a centralized scheduler for Horovod operations. Solution: organize in a communication tree.
• The existing Horovod implementation is able to reduce data residing on GPUs in two different ways, either by a standard MPI_Allreduce or by using the NVIDIA Collective Communications Library (NCCL)
• NCCL is better for intra-node (exploits NVLINK) and Standard MPI is better for inter-node
System Innovations
![Page 40: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/40.jpg)
40
RESULTS
![Page 41: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/41.jpg)
41
CLIMATE RESULTSAtmospheric Rivers (AR) in Blue and Tropical Cyclones (TC) in Red
![Page 42: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/42.jpg)
42
SCALING RESULTS
![Page 43: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/43.jpg)
43NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
OVERALL RESULTS
#GPUs GPU ArchPeak
(PFLOPS)Sustained (PFLOPS)
Efficiency
Tiramisu (FP32) 5300 P100 (CUDA) 26.6 21 79.0%
DeepLabv3+ (FP32) 27360 V100 (CUDA) 359.2 325.8 90.7%
DeepLabv3+ (FP16) 27360 V100 (Tensor Cores) 1130.0 999 88.4%
Tiramisu (PizDaint) and DeepLabv3 (Summit)
![Page 44: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/44.jpg)
44
PUSHING AICOMPUTING LIMITS
![Page 45: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/45.jpg)
45
![Page 46: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/46.jpg)
46
NVSWITCHWorld’s Highest Bandwidth On-node Switch
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per
port bi-directional
Fully-connected crossbar
2 billion transistors |
47.5mm x 47.5mm package
![Page 47: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/47.jpg)
47
NVIDIA DGX-2THE LARGEST GPU EVER CREATED
2 PFLOPS | 512GB HBM2 | 10 kW | 350 lbs
![Page 48: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/48.jpg)
48
MORE SCIENTIFIC EXAMPLES
![Page 49: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/49.jpg)
49
GALAXY CLASSIFICATIONMerging vs Not-Merging
![Page 50: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/50.jpg)
50
GALAXY CLASSIFICATIONMerging vs Not-Merging
![Page 51: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/51.jpg)
51
Fenix is at the 142nd position in the Top500
List.
576x V100 = 288x Nodes w/ 2xGPU/node
Source: Top500.org.
1º LATIN AMERICA SUPER-COMPUTER:
PETROBRAS FENIX
![Page 52: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/52.jpg)
52
TRAINING SET
Features (Seismic) Labels
![Page 53: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/53.jpg)
53
TRAINING IMAGESParihaka dataset (SEGY)
https://wiki.seg.org/wiki/Parihaka-3D
![Page 54: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/54.jpg)
54NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
METHOD
Based on the state-of-art on image compression work (CVPR-18):
“Conditional Probability Models for Deep
Image Compression” – Mentzer et. al.
Original work operates on 8-bits depth images. Changes for 32-bits and specific training protocols were performed for 3D post-stacked seismic data.
Conditional Problabilistc Deep Auto-Encoder
![Page 55: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/55.jpg)
55NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE.
EXPERIMENTSVisual Comparison
![Page 56: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/56.jpg)
56
NEW SUPERCOMPUTERS IN BRAZILLNCC (Rio de Janeiro)
376x V100 = 96x Nodes w/ 4xGPU/nodeSENAI-SIMATEC (Salvador)
312x V100 = 78x Nodes w/ 4xGPU/node
![Page 57: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/57.jpg)
57
LEARN & SHARE MORE
![Page 58: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/58.jpg)
58
CONNECT
Connect with hundreds of experts from top industry, academic, startup, and government organizations
LEARN
Gain insight and valuable hands-on training through over 500+ sessions
DISCOVER
See how GPU technology is creating breakthroughs in deep learning, cybersecurity, data science, healthcare and more
INNOVATE
Explore disruptive innovations that can transform your work
JOIN US AT GTC 2020 | USE VIP CODE XXXXX FOR 25% OFF
March 22—26, 2020 | Silicon Valley
Don’t miss the premier AI conference.
www.nvidia.com/gtc
![Page 59: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/59.jpg)
59
March 22 | Full-Day Workshops
March 23 - 26 | Conference & Training
Get the hands-on experience you need to transform the
future of AI, high-performance computing and more with
NVIDIA’s Deep Learning Institute (DLI).
Register for GTC 2020 to earn certification in full-day
workshops, join instructor-led sessions, and start self-
paced training.
www.nvidia.com/en-us/gtc/sessions/training/
THE LATEST DEEP LEARNING
DEVELOPER TOOLS
![Page 60: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/60.jpg)
60
developer.nvidia.com
![Page 61: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/61.jpg)
61
Deep Learning Fundamentals
Game Development & Digital Content
Finance
NVIDIA DEEP LEARNING INSTITUTE
Hands-on self-paced and instructor-led training in deep learning and accelerated computing for developers
Request onsite instructor-led workshops at your organization: www.nvidia.com/requestdli
Take self-paced labs online: www.nvidia.com/dlilabs
Download the course catalog, view upcoming workshops, and learn about the University Ambassador Program: www.nvidia.com/dli
Intelligent Video Analytics
Medical Image Analysis
Autonomous Vehicles
Accelerated Computing Fundamentals
More industry-specific training coming soon…
Genomics
![Page 62: SCALING DEEP LEARNING TO EXASCALE ACM GORDON BELL …...3 1980 1990 2000 2010 2020 GPU-Computing perf 1.5X per year 1000X by 2025 RISE OF GPU COMPUTING Original data up to the year](https://reader034.fdocuments.in/reader034/viewer/2022043003/5f81f7bf832d232532592039/html5/thumbnails/62.jpg)
62
NVIDIA HW GRANT PROGRAM
Titan V Volta
• Robotics
• Autonomous Machines
Jetson TX2(Dev Kit)
• Scientific Visualization
• Virtual Reality
Quadro P6000
• Scientific Computing
• HPC
• Deep Learning
https://developer.nvidia.com/academic_gpu_seeding