ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14...
Transcript of ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14...
![Page 1: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/1.jpg)
Pedro Mario Cruz e SilvaSolutions Architect Manager, Latin América | Global Energy Team
ERAD-RS’2019TESLA PLATFORM – HPC & AI
![Page 2: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/2.jpg)
2
1980 1990 2000 2010 2020
GPU-Computing perf
1.5X per year
1000X
by
2025
RISE OF GPU COMPUTING
Original data up to the year 2010 collected and plotted by M. Horowitz, F. Labonte, O. Shacham, K.
Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
102
103
104
105
106
107
Single-threaded perf
1.5X per year
1.1X per year
APPLICATIONS
SYSTEMS
ALGORITHMS
CUDA
ARCHITECTURE
![Page 3: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/3.jpg)
3
ELEVEN YEARS OF GPU COMPUTING
2010
Fermi: World’s First HPC GPU
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
2014
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
Google Outperforms Humans in ImageNet
2012
Discovered How H1N1 Mutates to Resist Drugs
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
2008
World’s First GPU Top500 System
2006
CUDA Launched
AlexNet beats expert code by huge margin using GPUs
Top 13 Greenest Supercomputers Powered
by NVIDIA GPUs
2017
![Page 4: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/4.jpg)
4
200B CORE HOURS OF LOST SCIENCEData Center Throughput is the Most Important Thing for HPC
Source: NSF XSEDE Data: https://portal.xsede.org/#/galleryNU = Normalized Computing Units are used to compare compute resources across supercomputers and are based on the result of the High Performance LINPACK benchmark run on each system
0
50
100
150
200
250
300
350
400
2009 2010 2011 2012 2013 2014 2015
Computing Resources Requested
Computing Resources Available
Norm
alized U
nit
(Billions)
National Science Foundation (NSF XSEDE) Supercomputing Resources
![Page 5: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/5.jpg)
5
1
10
100
1000
Mar-12 Mar-13 Mar-14 Mar-15 Mar-16 Mar-17 Mar-18
Re
lati
ve
Pe
rfo
rm
an
ce
Mar-19
2013
BEYOND MOORE’S LAW
Base OS: CentOS 6.2
Resource Mgr: r304
CUDA: 5.0
Thrust: 1.5.3
2019
Accelerated Server
With FermiAccelerated Server
with Volta
NPP: 5.0
cuSPARSE: 5.0
cuRAND: 5.0
cuFFT: 5.0
cuBLAS: 5.0
Base OS: Ubuntu 16.04
Resource Mgr: r384
CUDA: 10.0
NPP: 10.0
cuSPARSE: 10.0
cuSOLVER: 10.0
cuRAND: 10.0
cuFFT: 10.0
cuBLAS: 10.0
Thrust: 1.9.0
Progress Of Stack In 6 Years
GPU-Accelerated Computing
CPU
Moore’s Law
2013 2014 2015 2016 2017 2018 2019March
Rela
tive P
erf
orm
ance
![Page 6: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/6.jpg)
6
APPS &FRAMEWORKS
CUDA-XNVIDIA SDK & LIBRARIES)
NVIDIA DATA CENTER PLATFORMSingle Platform Drives Utilization and Productivity
VIRTUAL GPU
CUDA & CORE LIBRARIES - cuBLAS | NCCL
DEEP LEARNING
cuDNN
HPC
cuFFTOpenACC
+550 Applications
Amber
NAMD
CUSTOMER USE CASES
VIRTUAL GRAPHICS
Speech Translate Recommender
SCIENTIFIC APPLICATIONS
Molecular Simulations
WeatherForecasting
SeismicMapping
CONSUMER INTERNET & INDUSTRY APPLICATIONS
ManufacturingHealthcare Finance
MACHINE LEARNING
cuMLcuDF cuGRAPH cuDNN CUTLASS TensorRTvDWS vPC
Creative & Technical
Knowledge Workers
vAPPS
+600 Applications
TESLA GPUs & SYSTEMS
SYSTEM OEM CLOUDTESLA GPU NVIDIA HGXNVIDIA DGX FAMILY
![Page 7: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/7.jpg)
7
TRADITIONAL HPC
![Page 8: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/8.jpg)
8
“SCALABILITY OF CPU AND GPU SOLUTIONS OF THE PRIME ELLIPTIC CURVE DISCRETE LOGARITHM PROBLEM”
25.99 29.77
77.84
197.33
0
50
100
150
200
250
1 STI PS3 K40 + CUDA8.0 P100 + CUDA8.0 V100 + CUDA9.0
Visit Speed (106)
Jairo Panetta (ITA), Paulo Souza (ITA), Luiz Laranjeira (UnB), Carlos Teixeira Jr (UnB)
![Page 9: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/9.jpg)
9
Realtime Fleet AnalyticsStreamline routes to save >$28M
Engineering DesignAccelerate from hours to minutes
INDUSTRY EMBRACING GPU SUPERCOMPUTING
Oil and Gas Discovery10X increase in data processing
![Page 10: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/10.jpg)
10
“IBM-NVIDIA SERVERS ACHIEVE HIGH-PERFORMANCE COMPUTING MILESTONE IN OIL
INDUSTRY”
Servers 22,400
Processors 24
Total CPUs 537,600
Servers 30
GPUs 4
Total GPUs 120
https://www.forbes.com/sites/aarontilley/2017/04/25/ibm-nvidia-servers-achieve-high-performance-computing-milestone-in-oil-industry/#8e3b56626330
1 Billion Cells Resservoir Model
25 April 2017
ExxonMobil using the
Blue Water facility at NCSA
ECHELON – Simulation on GPUs
Stone Ridge Technologies
![Page 11: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/11.jpg)
11
RESERVOIR SIMULATION
Company Simulator/Method ModelProduction
SimulationRuntime Reference Cores/Servers
Saudi Aramco GIGAPOWERS
Three-phase black oil
1.03 Billion cells
3,000 wells
60 years 4 days[1]
Saudi Aramco GIGAPOWERS
Three-phase black oil
1.03 Billion cells
3,000 wells
60 years 21 hours[2]
5640 Cores
470 Servers
Total/Schlumberger INTERSECT 1.1 Billion cells
361 wells
20 years 10.5 hours[3]
576 Cores
288 Servers
ExxonMobil?
1 Billion cells? ? ?
716,800 Cores
22,400 Servers
StoneRidge Echelon
Three-phase black oil
1.01 Billion cells
1,000 wells
45 years 92 minutes?
120 GPUS
30 Servers
Performance Comparison
[1] SPE 119272 “A Next-Generation Parallel Reservoir Simulator for Giant Reservoirs”, A. Dogru et. al. 2009 SPE Reservoir Simulation Symposium.
[2] SPE 142297 “New Frontiers in Large Scale Reservoir Simulation”, A. Dogru et. al. 2011 SPE Reservoir Simulation Symposium.
[3] IPTC 17648 “Giga Cell Compositional Simulation”, E. Obi et. al., 2014 International Petroleum Technology Conference.
![Page 12: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/12.jpg)
12
ENI HPC4 – GREEN DATA CENTERThe World’s Most Powerful Industrial System
Source: https://www.eni.com/en_IT/innovation/technological-platforms/maximize-recovery/hpc.page#
100,000 high-resolution reservoir model simulation runs, taking into account geological uncertainties,
in a record time of 15 hours.3,200 NVIDIA Tesla P100 GPU’s
![Page 13: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/13.jpg)
13
DIGITAL SCIENCEHPC + AI + DATA
![Page 14: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/14.jpg)
14
FUSION OF HPC & AI
HPC AI
VOLTA TENSOR CORE GPU
GPU FUSES HPC & AI COMPUTING
MULTI-PRECISION COMPUTING
HPC (Simulation) – FP64, FP32
AI (Deep Learning) – FP16, INT8
![Page 15: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/15.jpg)
15
AI – A NEW INSTRUMENT FOR SCIENCE
AI> Neural Networks that learn patterns
from large data sets
> Improve predictive accuracy and faster
response time.
Dramatically Improves Accuracy and Time-to-Solution
HPC> Algorithms based on first principles
theory.
> Proven models for accurate results
Commercially
viable fusion
energy
Understanding
cosmological dark
energy and matter
Clinically viable
precision medicine
Improvement and
validation of the Standard
Model of Physics
Climate/weather
forecasts with ultra-
high fidelity
![Page 16: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/16.jpg)
16
AI FOR SCIENCETransformative Tool To Accelerate The Pace of Scientific Innovation
Improves AccuracyEnabling realization of full scientific potential
Accelerates Time to SolutionUnlocking the use of science in exciting new ways
300,000X FasterPredict Molecular Energetics
Drug Discovery
5,000X FasterProcess LIGO Signal
Understanding Universe
Weeks to 10 milliseconds Analyze Gravitational Lensing
Astrophysics
14X FasterGenerate Bose-Einstein Condensate (Physics)
90% accuracy Fusion Sustainment
Clean Energy
33% FasterTrack NeutrinosParticle Physics
70% accuracy Score Protein Ligand
Drug Discovery
11% higher accuracy Monitor Earth’s Vital
Climate
![Page 17: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/17.jpg)
TESLA V100 TENSOR CORE GPUWorld’s Most Powerful Data Center GPU
5,120 CUDA cores
640 NEW Tensor cores
7.8 FP64 TFLOPS | 15.7 FP32 TFLOPS
| 125 Tensor TFLOPS
20MB SM RF | 16MB Cache
32 GB HBM2 @ 900GB/s |
300GB/s NVLink
![Page 18: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/18.jpg)
18
TENSOR CORE4x4x4 matrix multiply and accumulate
![Page 19: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/19.jpg)
19
TENSOR CORES FOR SCIENCEMulti-precision computing
AI-POWERED WEATHER PREDICTION
PLASMA FUSION APPLICATION EARTHQUAKE SIMULATION
7.815.7
125
0
20
40
60
80
100
120
140
V100 TFLOPS
FP64+ MULTI-PRECISION
FP16 Solver
3.5x times faster
FP16/FP32
1.15x ExaOPS
FP16-FP21-FP32-FP64
25x times faster
![Page 20: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/20.jpg)
20
NVIDIA POWERS WORLD'S FASTEST SUPERCOMPUTER
27,648Volta Tensor Core GPUs
Summit Becomes First System To Scale The 100 Petaflops Milestone
122 PF 3 EFHPC AI
![Page 21: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/21.jpg)
21
NVIDIA POWERS FASTEST SUPERCOMPUTERS IN US, EUROPE, JAPAN, INDUSTRY
17 of World’s 20 Most Energy-efficient Supercomputers
Piz DaintEurope’s Fastest
5,320 GPUs| 20 PF
ORNL SummitWorld’s Fastest
27,648 GPUs| 122 PF
ABCIJapan’s Fastest
4,352 GPUs| 20 PF
ENI HPC4Fastest Industrial
3,200 GPUs| 12 PF
LLNL SierraUS 2nd Fastest
17,280 GPUs| 72 PF
![Page 22: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/22.jpg)
22
![Page 23: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/23.jpg)
23
DRAMATICALLY MORE FOR YOUR MONEY5X Better HPC TCO for Same Throughput
160 Self-hosted Skylake CPU Servers
96 KWatts
MIXED HPC WORKLOAD:Amber, CHROMA, GTC, LAMMPS, MILC, NAMD, Quantum Expresso, SPECFEM3D
8 Accelerated Servers w/4 V100 GPUs
13 KWatts
SAMETHROUGHPUT
1/5 THE COST
1/7THE SPACE
1/7THE POWER
MIXED HPC WORKLOAD:Amber, CHROMA, GTC, LAMMPS, MILC, NAMD, Quantum Espresso, SPECFEM3D
![Page 24: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/24.jpg)
24
BUILDING A PETAFLOP(*) MACHINE
How many GPUs do you need?
*Peak (With GPU)
![Page 25: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/25.jpg)
25
BUILDING A PETAFLOP(*) MACHINE
How many GPUs do you need?
• 1 PFLOPS = 1000 TFLOPS
• Tesla Volta V100 32GB
• 7.8 TFLOPS FP64
• N = 1000 / 7.8 ~= 128
*Peak (With GPU)
![Page 26: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/26.jpg)
26
BUILDING A PETAFLOP(*) MACHINE
How many GPUs do you need?
• 1 PFLOPS = 1000 TFLOPS
• Tesla Volta V100 32GB
• 7.8 TFLOPS FP64
• N = 1000 / 7.8 ~= 128
• Server w/ 8x GPUs and 4U ~= 16 Server (Strong Node)
• 1 Rack 48U = 12x 4U Server
• 1.33 Racks!
*Peak (With GPU)
![Page 27: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/27.jpg)
27
TESLA PLATFORM FOR DEVELOPERS
![Page 28: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/28.jpg)
28
HOW GPU ACCELERATION WORKSApplication Code
+
GPU CPU5% of Code
Compute-Intensive Functions
Rest of SequentialCPU Code
![Page 29: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/29.jpg)
29
HOW TO START WITH GPUS
Applications
Libraries
Easy to use
Most
Performance
Programming
Languages
Most
Performance
Most
Flexibility
CUDA
Easy to Start
Portable
Code
Compiler
Directives
432
1
1. Review available GPU-accelerated applications
2. Check for GPU-Accelerated applications and libraries
3. Add OpenACC Directives for quick acceleration results and portability
4. Dive into CUDA for highest performance and flexibility
![Page 30: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/30.jpg)
30
![Page 31: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/31.jpg)
31
DEEP LEARNING
GPU ACCELERATED LIBRARIES“Drop-in” Acceleration for Your Applications
LINEAR ALGEBRA PARALLEL ALGORITHMS
SIGNAL, IMAGE & VIDEO
TensorRT
nvGRAPH NCCL
cuBLAS
cuSPARSE cuRAND
DeepStream SDK NVIDIA NPPcuFFT
CUDA
Math library
cuSOLVER
CODEC SDKcuDNN
![Page 32: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/32.jpg)
32
WHAT IS OPENACC
main(){<serial code>#pragma acc kernels{ <parallel code>
}}
Add Simple Compiler Directive
Read more at www.openacc.org/about
Powerful & Portable
Directives-based
programming model for
parallel
computing
Designed for
performance
portability on
CPUs and GPUs
Simple
Programming Model for an Easy Onramp to GPUs
OpenACC is an open specification developed by OpenACC.org consortium
![Page 33: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/33.jpg)
33
PGI — THE NVIDIA HPC SDK
Fortran, C & C++ Compilers
Optimizing, SIMD Vectorizing, OpenMP
Accelerated Computing Features
CUDA Fortran, OpenACC Directives
Multi-Platform Solution
X86-64 and OpenPOWER Multicore CPUs
NVIDIA Tesla GPUs
Supported on Linux, macOS, Windows
MPI/OpenMP/OpenACC Tools
Debugger
Performance Profiler
Interoperable with DDT, TotalView
![Page 34: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/34.jpg)
34
V100 Tensor Cores
Full C++17 language
OpenACC printf()
CUDA 10.x support
OpenACC 2.6
OpenMP 4.5 for multicore
OpenACC Deep Copy
PGI in the Cloud
Fortran, C and C++
for the Tesla Platform
pgicompilers.com/whats-new
![Page 35: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/35.jpg)
35
Performance measured February, 2018. Skylake: Two 20 core Intel Xeon Gold 6148 CPUs @ 2.4GHz w/ 376GB memory, hyperthreading enabled. EPYC: Two 24 core AMD EPYC 7451 CPUs
@ 2.3GHz w/ 256GB memory. Broadwell: Two 20 core Intel Xeon E5-2698 v4 CPUs @ 3.6GHz w/ 256GB memory, hyperthreading enabled. Volta: NVIDIA DGX1 system with two 20 core
Intel Xeon E5-2698 v4 CPUs @ 2.20GHz, 256GB memory, one NVIDIA Tesla V100-SXM2 GPU @ 1.53GHz. SPEC® is a registered trademark of the Standard Performance Evaluation
Corporation (www.spec.org).
SPEC ACCEL 1.2 BENCHMARKS
0
50
100
150
200
2-socket Skylake 2-socket EPYC 2-socket BroadwellG
EO
MEA
N S
econds
Intel 2018 PGI 18.1
OpenMP 4.5
40 cores / 80 threads 48 cores / 48 threads 40 cores / 80 threads
0
50
100
150
200
GEO
MEA
N S
econds
PGI 18.1
OpenACC
2-socket Broadwell
1x VoltaV100
4.4xSpeed-up
![Page 36: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/36.jpg)
36
SINGLE CODE FOR MULTIPLE PLATFORMS
pgcc –fast <myCode>.c -o myApp [Serial]
pgcc –fast –ta=multicore <myCode>.c -o myApp [parallel cpu]
pgcc –fast –ta=tesla <myCode>.c -o myApp [parallel gpu]
Compiler Options
![Page 37: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/37.jpg)
37
Resourceshttps://www.openacc.org/resources
Success Storieshttps://www.openacc.org/success-stories
Eventshttps://www.openacc.org/events
OPENACC.ORG RESOURCESGuides ● Talks ● Tutorials ● Videos ● Books ● Spec ● Code Samples ● Teaching Materials ● Events ● Success Stories ● Courses ● Slack ● Stack Overflow
Compilers and Tools https://www.openacc.org/tools
Open Source Compiler
https://www.openacc.org/community#slack
GCC 7
Includes initial support
for OpenACC 2.5
![Page 38: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/38.jpg)
38
CUDA TOOLKIT 10.0
New GPU Architecture, Tensor Cores, NVSwitch Fabric
TURING AND NEW SYSTEMSCUDA Graphs, Vulkan & DX12 Interop, Warp Matrix
CUDA PLATFORM
GPU-accelerated hybrid JPEG decoding,Symmetric Eigenvalue Solvers, FFT Scaling
LIBRARIESNew Nsight Products – Nsight Systems and Nsight Compute
DEVELOPER TOOLS
Scientific Computing
![Page 39: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/39.jpg)
39
POWERING THE DEEP LEARNING ECOSYSTEMNVIDIA SDK Accelerates Every Major Framework
COMPUTER VISION
OBJECT DETECTION IMAGE CLASSIFICATION
SPEECH & AUDIO
VOICE RECOGNITION LANGUAGE TRANSLATION
NATURAL LANGUAGE PROCESSING
RECOMMENDATION ENGINES SENTIMENT ANALYSIS
DEEP LEARNING FRAMEWORKS
NVIDIA DEEP LEARNING SDK and CUDA
developer.nvidia.com/deep-learning-software
![Page 40: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/40.jpg)
40
DEEP LEARNING
![Page 41: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/41.jpg)
41
LEARNING FROM DATAAND SOME BUZZ WORDS
ARTIFICALINTELLIGENCE
MACHINELEARNING DEEP
LEARNING
Knowledge & Reason
Learning
Planning
Communicating
Perceiving
Learning from data
Expert systems
Handcrafted features
Learning from data
Neural networks
Computer learned features
![Page 42: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/42.jpg)
42
A NEW COMPUTING MODEL
“Label”
Input
Training Data
Output
Trained NeuralNetwork
Trained NeuralNetwork
“Label”
OutputInput
TRAINING
INFERENCE
![Page 43: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/43.jpg)
43
A NEW COMPUTING MODELOutperform experts, facts, rules with software that writes software
Deep Learning Object DetectionDNN + Data + GPU
Traditional Computer VisionExperts + Time
Deep Learning Achieves “Superhuman” Results
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
2009 2010 2011 2012 2013 2014 2015 2016
Traditional CV
Deep Learning
ImageNet
![Page 44: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/44.jpg)
44
“ACCELERATING EULERIAN FLUID SIMULATION WITH CONVOLUTIONAL NETWORKS”
Tompson, J., Schlachter, K., Sprechmann, P., & Perlin, K. (2016). Accelerating Eulerian Fluid Simulation With Convolutional Networks. arXiv preprint arXiv:1607.03597.
![Page 45: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/45.jpg)
45
"ACCELERATING EULERIAN FLUID SIMULATION WITH CONVOLUTIONAL NETWORKS"HTTPS://WWW.YOUTUBE.COM/WATCH?V=W71ZXKNIJFO
![Page 46: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/46.jpg)
46
![Page 47: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/47.jpg)
47
![Page 48: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/48.jpg)
48
TESLA REVOLUTIONIZES DEEP LEARNING
GOOGLE BRAIN APPLICATION
BEFORE TESLA AFTER TESLA
Cost $5,000K $200K
Servers 1,000 Servers 16 Tesla Servers
Energy 600 KW 4 KW
Performance 1x 6x
![Page 49: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/49.jpg)
49
NEW AI DRIVING
Training on DGX-1
Driving with DriveWorks
KALDI
LOCALIZATION
MAPPING
DRIVENET
DAVENET
NVIDIA DGX-1 NVIDIA DRIVE PX
WATCH VIDEO
![Page 50: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/50.jpg)
50
NVIDIA DRIVE PEGASUSFirst AI Computer to Make Robotaxis a Reality
WATCH VIDEO
![Page 51: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/51.jpg)
51
![Page 52: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/52.jpg)
52First Industry Benchmark for Measuring AI Performance
https://mlperf.org/
![Page 53: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/53.jpg)
53
ML-PERFResults, December 2018
![Page 54: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/54.jpg)
54
MLPERF RESULTS - AT SCALEResults are Time to Complete Model Training
Image Classification
RN50 v.1.5Object Detection
(Heavy Weight)
Mask R-CNN
Object Detection
(Light Weight)
SSD
Translation (recurrent)
GNMTTranslation (non-recurrent)
Transformer
6.3 minutes 72.1 minutes 5.6 minutes
2.7 minutes 6.2 minutes
Test Platform: For Image Classification and Translation (non-recurrent), DGX-1V Cluster. For Object Detection (Heavy Weight) and Object Detection (Light Weight),
Translation (recurrent) DGX-2H Cluster. Each DGX-1V, Dual-Socket Xeon E5- 2698 V4, 512GB system RAM, 8 x 16 GB Tesla V100 SXM-2 GPUs. Each DGX-2H, Dual-Socket Xeon
Platinum 8174, 1.5TB system RAM, 16 x 32 GB Tesla V100 SXM-3 GPUs connected via NVSwitch
![Page 55: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/55.jpg)
55
TESLA PLATFORM ENABLES DRAMATIC REDUCTION IN TIME TO TRAIN
0 20 40 60 80 100 120 140
2x CPU
Single Node1X P100
Single Node1X V100
DGX-18x V100
At scale2176x V100
Relative Time to Train Improvements(ResNet-50)
ResNet-50, 90 epochs to solution | CPU Server: dual socket Intel Xeon Gold 6140Sony 2176x V100 record on https://nnabla.org/paper/imagenet_in_224sec.pdf
<4 Minutes
3.3 Hours
25 Days
30 Hours
4.8 Days
![Page 56: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/56.jpg)
56
![Page 57: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/57.jpg)
57
NVSWITCHWorld’s Highest Bandwidth On-node Switch
7.2 Terabits/sec or 900 GB/sec
18 NVLINK ports | 50GB/s per
port bi-directional
Fully-connected crossbar
2 billion transistors |
47.5mm x 47.5mm package
![Page 58: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/58.jpg)
58
ANNOUNCING NVIDIA DGX-2THE LARGEST GPU EVER CREATED
2 PFLOPS | 512GB HBM2 | 10 kW | 350 lbs
![Page 59: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/59.jpg)
59
0
5
10
15
20
HGX-1 HGX-2
HGX-2 vs HGX-1 Performance Benchmark
10X PERFORMANCE GAIN IN LESS THAN A YEAR
HGX-1, SEP’17 HGX-2, MAY‘18
15 days
1.5 days
software improvements across the stack including NCCL, cuDNN, etc.
FairSeq, trained with WMT’14 English-French dataset in 55 epochs
HGX-1 9/2017 SW stack (run on NVIDIA DGX-1)
HGX-2 3/2018 SW stack (run on NVIDIA DGX-2)
![Page 60: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/60.jpg)
60Transformer with MoE Layers | Training Dataset: 1B Word Benchmark for Language Modeling | Batch size of 8,192 per GPU
SCALING-UP PERFORMANCE WITH NVSWITCH
0
60,000
120,000
180,000
0 4 8 12 16
V100 (NVLink, NVSwitch)
V100 (PCIe)
# of V100 GPUs
Tokens/
second
![Page 61: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/61.jpg)
61
AI AND HPC BENCHMARKS: HGX-2 VS CPUReplace CPU Nodes - Save Money, Power and Space in the Data Center
0
50
100
150
200
250
300
350
Dual Socket CPU HGX-2
Speed-U
p o
f Sin
gle
Node
AI Training: HGX-2 Replaces 300 CPU-Only Server Nodes
1
300X
Dual-Socket CPU0
10
20
30
40
50
60
70
Dual Socket CPU HGX-2
Speed-U
p o
f Sin
gle
Node
HPC: HGX-2 Replaces 60 CPU-Only Server Nodes
1
60X
Dual-Socket CPU
Workload: ResNet50, 90 epochs to solution | CPU Server: Dual-Socket Intel Xeon Gold 6140| Dataset: ImageNet2012 |
Workload: MILC (particle physics HPC application) | CPU Server: Dual-Socket Intel Xeon Gold 6140
![Page 62: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/62.jpg)
62
DEEP LEARNING INFERENCE
![Page 63: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/63.jpg)
63
GPU INFERENCE ADOPTION IS ACCELERATING
60X Latency Improvement
Real-Time Search
12X Faster Inference
Live Video Analysis
40X Higher Performance
Real-Time Brand ImpactTesla P4, TensorRT Adoption
Use Cases
VISUAL SEARCH VIDEO ANALYSIS ADVERTISING INFERENCE USE CASES
Video
MapsImage
NLP
Speech
Search
![Page 64: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/64.jpg)
64
WORLD’S LEADING TECH COMPANIES ADOPT NVIDIA TO ACCELERATE AI DEPLOYMENT
2017 2018
7X TensorRT Downloads
40K
300K
PaypalFraud Detection
TwitterVideo Analytics
BytedanceNLP
SnapRecommendation
ClarifaiComputer Vision
PinterestVisual Search
John DeereSmart Farming
iFlyTekSpeech Recognition
![Page 65: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/65.jpg)
65
TENSORRT INFERENCE SERVER
WORLD’S MOST ADVANCED SCALE-OUT GPU
INTEGRATED INTO TENSORFLOW & ONNX SUPPORT
TENSORRT HYPERSCALE INFERENCE PLATFORM
![Page 66: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/66.jpg)
66
320 Turing Tensor Cores
2,560 CUDA Cores
65 FP16 TFLOPS | 130 INT8 TOPS | 260 INT4 TOPS
16GB | 320GB/s
70 W
TESLA T4WORLD’S MOST ADVANCED SCALE-OUT GPU
![Page 67: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/67.jpg)
67
MACHINE LEARNING RAPIDS
![Page 68: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/68.jpg)
68
THE BIG PROBLEM IN DATA SCIENCE
All
DataETL
Manage Data
Structured
Data Store
Data Preparation
Training
Model Training
Visualization
Evaluate
Inference
Deploy
Slow Training Times for Data Scientists
![Page 69: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/69.jpg)
69
ACCELERATING MACHINE LEARNINGThe RAPIDS Ecosystem
Open Source Community
Enterprise Data Science Platforms
StartupsDeep Learning
Integration
GPU Servers Storage Partners
![Page 70: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/70.jpg)
70
RAPIDS — OPEN GPU DATA SCIENCESoftware Stack
Data Preparation VisualizationModel Training
CUDA
PYTHON
APACHE ARROW
DASK
DEEP LEARNING
FRAMEWORKS
CUDNN
RAPIDS
CUMLCUDF CUGRAPH
![Page 71: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/71.jpg)
71
DRAMATICALLY MORE FOR YOUR MONEY
300 Self-hosted Broadwell CPU Servers
180 KWatts
Machine Learning: XGBoost
1 DGX-2
10 KWatts
Machine Learning:XGBoost
GPU-AcceleratedCPU-Only Cluster
SAMETHROUGHPUT
1/8 THE COST
1/18THE POWER
1/30THE SPACE
![Page 72: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/72.jpg)
72
DGX POD
![Page 73: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/73.jpg)
73
40 PetaFLOPS Peak FP64 Performance | 660 PetaFLOPS DL FP16 Performance | 660 NVIDIA DGX-1 Server Nodes
ANNOUNCING NVIDIA SATURNV WITH VOLTA
ANNOUNCINGNVIDIA SATURNV WITH VOLTA
![Page 74: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/74.jpg)
74
DGX POD — DGX-1Reference Architecture in a Single 35 kW High-Density Rack
Fit within a standard-height 42 RU data center rack
• Nine DGX-1 servers(9 x 3 RU = 27 RU)
• Twelve storage servers(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and management switch(1 RU)
• Mellanox 100 Gbps intra-rack high speed network switches(1 or 2 RU)
In real-life DL application development, one to two
DGX-1 servers per developer are often required
One DGX POD supports five developers (AV workload)
Each developer works on two experiments per day
One DGX-1/developer/experiment/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
![Page 75: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/75.jpg)
75
DGX POD — DGX-2Reference Architecture in a Single 35 kW High-Density Rack
Fit within a standard-height 48 RU data center rack
• Three DGX-2 servers(3 x 10 RU = 30 RU)
• Twelve storage servers(12 x 1 RU = 12 RU)
• 10 GbE (min) storage and management switch(1 RU)
• Mellanox 100 Gbps intra-rack high speed network switches(1 or 2 RU)
In real-life DL application development, one DGX-2 per
developer minimizes model training time
One DGX POD supports at least three developers
(AV workload)
Each developer works on two experiments per day
One DGX-2/developer/2 experiments/day*
*300,000 0.5M images * 120 epochs @ 480 images/sec
Resnet-18 backbone detection network per experiment
![Page 76: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/76.jpg)
76
NVIDIA GPU CLOUD (NGC)
![Page 77: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/77.jpg)
77
Cloud
DOWNLOAD AND DEPLOY
On-premises
Source code, libraries, packages
Source available on Github | Container available from NGC and Dockerhub | PIP available at a later date
NGC
![Page 78: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/78.jpg)
78
50+ GPU-OPTIMIZED SOFTWARE CONTAINERS
DEEP LEARNING MACHINE LEARNING
HPC VISUALIZATION
INFERENCE
GENOMICS
NAMD | GROMACS | more
RAPIDS | H2O | more TensorRT | DeepStream | more
Parabricks ParaView | IndeX | more
TensorFlow | PyTorch | more
![Page 79: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/79.jpg)
79
NGC-READY SYSTEMS
VALIDATED FOR
PERFORMANCE &
FUNCTIONALITY OF
NGC SOFTWARE
T4 & V100-ACCELERATED
* Only V100 systems
*
*
*
*
![Page 80: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/80.jpg)
80
DGX POD MANAGEMENT
SOFTWARE
![Page 81: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/81.jpg)
81
DGX POD MANAGEMENT SOFTWAREFor Large-Scale Multi-User AI Software Development Teams
![Page 82: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/82.jpg)
82
SUPPORT PROGRAMS
![Page 83: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/83.jpg)
83
IEEE – IPDPS 201920–24 de Maio, Rio de Janeiro
Keynote @ ScaDL Workshop
“Scalable Deep Learning over Parallel and
Distributed Infrastructures”
24 de Maio
OpenACC
Hands-On Training
21 de Maio
![Page 85: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/85.jpg)
85
Deep Learning Fundamentals
Game Development & Digital Content
Finance
NVIDIA DEEP LEARNING INSTITUTE
Hands-on self-paced and instructor-led training in deep learning and accelerated computing for developers
Request onsite instructor-led workshops at your organization: www.nvidia.com/requestdli
Take self-paced labs online: www.nvidia.com/dlilabs
Download the course catalog, view upcoming workshops, and learn about the University Ambassador Program: www.nvidia.com/dli
Intelligent Video Analytics
Medical Image Analysis
Autonomous Vehicles
Accelerated Computing Fundamentals
More industry-specific training coming soon…
Genomics
![Page 86: ERAD-RS’2019 TESLA PLATFORM HPC & AI · 2019. 4. 27. · 5 1 10 100 1000 M ar-12 M ar-13 M ar-14 M ar-15 M ar-16 M ar-17 M ar-18 R e l a t i v e P e r f o r m a n c e Mar-19 2013](https://reader036.fdocuments.in/reader036/viewer/2022071109/5fe4a6a38603d10d312c0114/html5/thumbnails/86.jpg)
86
NVIDIA HW GRANT PROGRAM
Titan V Volta
• Robotics
• Autonomous Machines
Jetson TX2(Dev Kit)
• Scientific Visualization
• Virtual Reality
Quadro P6000
• Scientific Computing
• HPC
• Deep Learning
https://developer.nvidia.com/academic_gpu_seeding