MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW &...
Transcript of MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW &...
![Page 1: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/1.jpg)
July 2017
MACHINE LEARNING WITH NVIDIA AND IBM POWER AI
Joerg KrallSr. Business Ddevelopment ManagerMFG [email protected]
![Page 2: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/2.jpg)
2
A NEW ERA OF COMPUTING
PC INTERNETWinTel, Yahoo!1 billion PC users
1995 2005 2015
MOBILE-CLOUDiPhone, Amazon AWS2.5 billion mobile users
AI & IOTDeep Learning, GPU100s of billions of devices
![Page 3: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/3.jpg)
3
Artificial IntelligenceComputer GraphicsGPU Computing
NVIDIA“THE AI COMPUTING COMPANY”
![Page 4: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/4.jpg)
4
TEN YEARS OF GPU COMPUTING
2006 2008 2012 20162010 2014
Fermi: World’s First HPC GPU
Oak Ridge Deploys World’s Fastest Supercomputer w/ GPUs
World’s First Atomic Model of HIV Capsid
GPU-Trained AI Machine Beats World Champion in Go
Stanford Builds AI Machine using GPUs
World’s First 3-D Mapping of Human Genome
CUDA Launched
World’s First GPU Top500 System
Google Outperforms Humans in ImageNet
Discovered How H1N1 Mutates to Resist Drugs
AlexNet beats expert code by huge margin using GPUs
![Page 5: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/5.jpg)
5
AI,ML,DL……..?
![Page 6: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/6.jpg)
6
GPU COMPUTING HAS REACHED A TIPPING POINT
3x GPU Developers in 2 Years
120,000
400,000
2014 2016
13x Organizations Engaged with NVIDIA for DL
Higher Ed
Internet
Healthcare
Finance
Automotive
Developer Tools
Government
Others
1,549
19,439
2014 2016
![Page 7: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/7.jpg)
7
![Page 8: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/8.jpg)
8
TESLA PLATFORM POWERS
LEADING DATA CENTERS
FOR HPC AND AI
![Page 9: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/9.jpg)
9
DEEP LEARNING
![Page 10: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/10.jpg)
10
A NEW COMPUTING MODEL
Traditional Computer Vision
Domain experts design feature detectors
Quality = patchwork of algorithms
Need CV experts and time
Deep Learning Object Detection
DNN learn features from large data
Quality = data & training method
Needs lots of data and compute
![Page 11: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/11.jpg)
11
DEEP LEARNING —A NEW COMPUTING MODEL
“Software that writes software”
“little girl is eating
piece of cake"
LEARNING
ALGORITHM
“millions of trillions
of FLOPS”
![Page 12: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/12.jpg)
12
PERCEPTION AI PERCEPTION AI LOCALIZATION DRIVING AI
DEEP LEARNING
SELF-DRIVING CARS ARE AN AI CHALLENGE
![Page 13: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/13.jpg)
13
AI FOR SELF-DRIVING CARS
FREE SPACE DETECTION CAR 3D DETECTION
![Page 14: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/14.jpg)
14
AI IS EVERYWHERE
“Find where I parked my car” “Find the bag I just saw in this magazine”
“What movie should I watch next?”
![Page 15: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/15.jpg)
15
TOUCHING OUR LIVES
Bringing grandmother closer to family by bridging language barrier
Predicting sick baby’s vitals like heart rate, blood pressure, survival rate
Enabling the blind to “see” their surrounding, read emotions on faces
![Page 16: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/16.jpg)
16
FUELING ALL INDUSTRIES
Increasing public safety with smart video surveillance at airports & malls
Providing intelligent services in hotels, banks and stores
Separating weeds as it harvests, reduces chemical usage by 90%
![Page 17: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/17.jpg)
17
TESLA REVOLUTIONIZES DEEP LEARNING
GOOGLE BRAIN APPLICATION
BEFORE TESLA AFTER TESLA
Cost $5,000K $200K
Servers 1,000 Servers 16 Tesla Servers
Energy 600 KW 4 KW
Performance 1x 6x
![Page 18: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/18.jpg)
18
DEEP LEARNING WORKFLOWS
IMAGE CLASSIFICATION OBJECT DETECTION IMAGE SEGMENTATION
Classify images into classes or categories
Object of interest could be anywhere in the image
Find instances of objects in an image
Objects are identified with bounding boxes
Partition image into multiple regions
Regions are classified at the pixel level
98% Dog
2% Cat
![Page 19: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/19.jpg)
19
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
TRAINING
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
![Page 20: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/20.jpg)
20
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
DATACENTER INFERENCING
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize datacenter throughput
![Page 21: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/21.jpg)
21
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
Training
Device
Datacenter
TRAINING
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize datacenter throughput
DATACENTER INFERENCING
Billions of intelligent devices
& machines
Recognition, reasoning, problem solving
GPU inference: real-time
accurate response
DEVICE INFERENCING
![Page 22: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/22.jpg)
22
NEW AI SERVICES POSSIBLE WITH GPU CLOUD
SPOTIFYSONG RECOMMENDATIONS
NETFLIXVIDEO RECOMMENDATIONS
YELPSELECTING COVER PHOTOS
![Page 23: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/23.jpg)
23
DL INFERENCE ON GPU CLOUD COMPUTING
PINTERESTVISUAL SEARCH
TWITTERVIDEO CLASSIFICATION
KLMCUSTOMER SERVICE
![Page 24: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/24.jpg)
24
DEEP LEARNING REQUIREMENTS
DEEP LEARNING NEEDS… DEEP LEARNING CHALLENGES NVIDIA DELIVERS
Data Scientists Demand far exceeds supply DIGITS, DLI Training
Latest Algorithms Rapidly evolvingDL SDK, GPU-Accelerated
Frameworks
Fast Training Impossible -> Practical DGX-1, P100, P40
Deployment Platform Must be available everywhereTensorRT, P40, P4, Jetson,
Drive PX
![Page 25: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/25.jpg)
25
TESLA PLATFORM FOR HPC AND AI
![Page 26: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/26.jpg)
26
ONE ARCHITECTURE BUILT FOR BOTHDATA SCIENCE & COMPUTATIONAL SCIENCE
0x
10x
20x
30x
40x
1 4 8 16 32 64 128
0x
1x
2x
3x
4x
5x
6x
7x
8x
9x
1 4 8 16 32 64
AlexNet Training8x P100 Faster than 128 Knights Landing
Servers
GTC-P: Plasma Turbulence8x P100 Faster than 64 Knights Landing Servers
Knights Landing Servers Knights Landing Servers 1x DGX11x DGX1
Speed-u
p v
s 1x K
NL S
erv
er
GTC-P, Grid Size A, Systems: NVIDIA DGX-1, 8xP100, Intel KNL 7250
68 core Flat-Quadrant mode, OmnipathBased on AlexNet Batch size 256, weak scaling up to 32 KNL servers, 64 & 128 estimated based on ideal scaling, Xeon Phi 7250 Nodes
Tesla Platform
![Page 27: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/27.jpg)
27
TWO BIG INEFFICIENCIES WITH CPU NODES
Sources: Microsoft Research on Datacenter Costs, Amber Simulations on SDSU Comet Supercomputer
Typical Data Center Budget
Compute Servers, 39%
Non-compute, 61%
Most of budget spent on non-compute overhead
Rack, Cabling
Infrastructure
Networking0
20
40
60
# of CPUs
Typical Strong-Scaling Performance
Linear scaling
ns/
day
AMBER Benchmark
0 20 40 60 80 100
Network overhead causes performance inefficiencies
![Page 28: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/28.jpg)
28
WEAK NODESLots of Nodes Interconnected with
Vast Network Overhead
STRONG NODESFew Lightning-Fast Nodes with
Performance of Hundreds of Weak Nodes
Network Fabric
Server Racks
![Page 29: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/29.jpg)
29
TESLA PLATFORMLeading Data Center Platform for HPC and AI
TESLA GPU & SYSTEMS
NVIDIA SDK
INDUSTRY TOOLS
APPLICATIONS & SERVICES
C/C++
ECOSYSTEM TOOLS & LIBRARIES
HPC
+400 More Applications
cuBLAS
cuDNN TensorRT
DeepStreamSDK
FRAMEWORKS MODELS
Cognitive Services
AI TRAINING & INFERENCE
Machine Learning Services
ResNetGoogleNetAlexNet
DeepSpeechInceptionBigLSTM
DEEP LEARNING SDK
NCCL
COMPUTEWORKS
NVLINK SYSTEM OEM CLOUDTESLA GPU
![Page 30: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/30.jpg)
30
ONE PLATFORM BUILT FOR BOTHDATA SCIENCE & COMPUTATIONAL SCIENCE
Accelerating AITesla Platform Accelerating HPC
![Page 31: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/31.jpg)
31
TESLA P100New GPU Architecture to Enable the World’s Fastest Compute Node
Pascal Architecture NVLink CoWoS HBM2 Page Migration Engine
Highest Compute Performance GPU Interconnect for Maximum Scalability
Unifying Compute & Memory in Single Package
Simple Parallel Programming with Virtually Unlimited Memory Space
Unified Memory
CPU
Tesla P100
![Page 32: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/32.jpg)
32
GIANT LEAPS
IN EVERYTHING
NVLINK
PAGE MIGRATION ENGINE
PASCAL ARCHITECTURE
CoWoS HBM2 Stacked Mem
K40Tera
flops
(FP32/FP16)
5
10
15
20
P100
(FP32)
P100
(FP16)
M40
K40
Bi-
dir
ecti
onal BW
(G
B/Sec)
40
80
120
160P100
M40
K40Bandw
idth
(G
B/s)
200
400
600
P100
M40 K40
Addre
ssable
Mem
ory
(G
B)
10
100
1000
P100
M40
21 Teraflops of FP16 for Deep Learning 5x GPU-GPU Bandwidth
3x Higher for Massive Data Workloads Virtually Unlimited Memory Space
10000800
![Page 33: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/33.jpg)
33
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
TRAINING
Billions of Trillions of Operations
GPU train larger models, accelerate
time to market
![Page 34: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/34.jpg)
34
TESLA P100 ACCELERATORS
Compute 5.3 TF DP ∙ 10.6 TF SP ∙ 21.2 TF HP 4.7 TF DP ∙ 9.3 TF SP ∙ 18.7 TF HP
Memory HBM2: 732 GB/s ∙ 16 GBHBM2 16GB: 732 GB/s
HBM2 12GB: 549 GB/s
Interconnect NVLink (160 GB/s) + PCIe Gen3 (32 GB/s) PCIe Gen3 (32 GB/s)
ProgrammabilityPage Migration Engine
Unified Memory
Page Migration Engine
Unified Memory
Power 300W 250W
Tesla P100 with NVLink Tesla P100 for PCIe
Interconnect Speed is measured bi-directional
![Page 35: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/35.jpg)
35
Training
Device
Datacenter
GPU DEEP LEARNING IS A NEW COMPUTING MODEL
DATACENTER INFERENCING
10s of billions of image, voice, video
queries per day
GPU inference for fast response,
maximize datacenter throughput
![Page 36: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/36.jpg)
36
TESLA P40
P40
# of CUDA Cores 3840
Peak Single Precision 12 TeraFLOPS
Peak INT8 47 TOPS
Low Precision4x 8-bit vector dot product
with 32-bit accumulate
Video Engines 1x decode engine, 2x encode engines
GDDR5 Memory 24 GB @ 346 GB/s
Power 250W
0
20,000
40,000
60,000
80,000
100,000
GoogLeNet AlexNet
8x M40 (FP32) 8x P40 (INT8)
Images/
Sec
4x Boost in Less than One Year
GoogLeNet, AlexNet, batch size = 128, CPU: Dual Socket Intel E5-2697v4
Highest Throughput for Scale-up Servers
![Page 37: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/37.jpg)
37
40x Efficient vs CPU, 8x Efficient vs FPGA
0
50
100
150
200
AlexNet
CPU FPGA 1x M4 (FP32) 1x P4 (INT8)
Images/
Sec/W
att
Maximum Efficiency for Scale-out Servers P4
# of CUDA Cores 2560
Peak Single Precision 5.5 TeraFLOPS
Peak INT8 22 TOPS
Low Precision4x 8-bit vector dot product
with 32-bit accumulate
Video Engines 1x decode engine, 2x encode engine
GDDR5 Memory 8 GB @ 192 GB/s
Power 50W & 75 W
AlexNet, batch size = 128, CPU: Intel E5-2690v4 using Intel MKL 2017, FPGA is Arria10-1151x M4/P4 in node, P4 board power at 56W, P4 GPU power at 36W, M4 board power at 57W, M4 GPU power at 39W, Perf/W chart using GPU power
TESLA P4
![Page 38: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/38.jpg)
38
GPU SERVERS WITH NVLINK
![Page 39: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/39.jpg)
39
S822LC FOR HPC: RECOMMENDED CONFIGURATION FOR POWERAI
2 SOCKET, 4 GPU SYSTEM WITH NVLINK
Required:
2 POWER8 10 Core CPUs
4 NVIDIA P100 ”Pascal” GPUs
256 GB System Memory
2 SSD storage devices
High-speed interconnect (IB or Ethernet, depending on infrastructure)
Optional:
Up to 1 TB System Memory
PCIe attached NVMe storage
![Page 40: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/40.jpg)
40
POWERAI TAKES ADVANTAGE OF NVLINK BETWEEN POWER8 & P100 TO INCREASE SYSTEM BANDWIDTH
NVLink between CPUs and GPUs enables fast memory access to large data sets in system memory
Two NVLink connections between each GPU and CPU-GPU leads to faster data exchange P100
GPU
Power8CPU
GPUMemory
System Memory
P100GPU
80 GB/s
GPUMemory
NVLink
115 GB/s
P100GPU
Power8CPU
GPUMemory
System Memory
P100GPU
80 GB/s
GPUMemory
NVLink
115 GB/s
![Page 41: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/41.jpg)
41
NVLINK AND P100 ADVANTAGE:REDUCING COMMUNICATION TIME, INCORPORATING THE FASTEST GPU FOR
DEEP LEARNING
NVLink reduces communication time and overhead
Data gets from GPU-GPU, Memory-GPU faster, for shorter training times
x86 based
GPU system
POWER8 +
Tesla
P100+NVLin
k
ImageNet / Alexnet: Minibatch size = 128
170 ms
78 ms
IBM advantage: data communication
and GPU performance
![Page 42: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/42.jpg)
42
0.0x
0.5x
1.0x
1.5x
2.0x
2.5x
Caffe/AlexnetOWT Caffe/GoogLenet Caffe/VGG-D Caffe/Incep v3 Caffe/ResNet-50
8x M40 8x P40 8x P100 PCIE 8x P100 NVLINK
Refer to “Measurement Config.” slide for HW & SW details, run with real data set
AlexnetOWT/GoogLenet use total batch size=1024, VGG-D uses total batch size=512, Incep-v3/ResNet-50 use total batch size=256
Speedup
Img/sec4980 2177 516 433 653
1.45x
1.25x
2.5x
13539
3963
1157
732
1173
P100 FOR FASTEST TRAININGFP32 Training, 8 GPU Node
1.2x1.15x
![Page 43: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/43.jpg)
43
NVIDIA DEEP LEARNING PARTNERS
Graph Analytics Enterprises Data ManagementDL Frameworks Enterprise DL
Services Core Analytics Tech
![Page 44: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/44.jpg)
44
TESLA PASCAL FAMILY
![Page 45: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/45.jpg)
45
END-TO-END PRODUCT FAMILY
TRAININGINFERENCE
EMBEDDED
Jetson TX1
DATA CENTER
Tesla P4
Tesla P40
AUTOMOTIVE
Drive PX2
Tesla P100
Tesla P40 & P4
Tesla P100
Quadro GP100
DATA CENTERDESKTOP
FULLY INTERGRATED DL SUPERCOMPUTER
![Page 46: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/46.jpg)
46
TESLA PRODUCTS RECOMMENDATION
PRODUCT P100 NVLINK P100 PCIE P40 P4
Target Use Cases
• Highest DL training perf
• Fastest time-to-solution
• Larger “Model Parallel”
DL model with 16GB x 8
• HPC DC running mix of
CPU and GPU workload
• Best throughput / $
with mix workload
• Highest inference perf
• Larger “Data Parallel”
DL model with 24GB
• Low power, low profile
optimized for scale out
deployment
• Most efficient inference
and video processing
Best Configs.
• 8 way Hybrid Cube Mesh • 2-4 GPU/node (HPC) • Up to 8 GPU/node • 1-2 GPU/node
1st Server Ship
• Available Now • Available Now • OEM starting Oct’16 • OEM starting Nov’16
![Page 47: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/47.jpg)
47
K80 M40 M4P100
(SXM2)
P100
(PCIE)P40 P4
GPU 2x GK210 GM200 GM206 GP100 GP100 GP102 GP104
PEAK FP64 (TFLOPs) 2.9 NA NA 5.3 4.7 NA NA
PEAK FP32 (TFLOPs) 8.7 7 2.2 10.6 9.3 12 5.5
PEAK FP16 (TFLOPs) NA NA NA 21.2 18.7 NA NA
PEAK TIOPs NA NA NA NA NA 47 22
Memory Size 2x 12GB GDDR5 24 GB GDDR5 4 GB GDDR5 16 GB HBM2 16/12 GB HBM2 24 GB GDDR5 8 GB GDDR5
Memory BW 480 GB/s 288 GB/s 80 GB/s 732 GB/s 732/549 GB/s 346 GB/s 192 GB/s
Interconnect PCIe Gen3 PCIe Gen3 PCIe Gen3NVLINK +
PCIe Gen3PCIe Gen3 PCIe Gen3 PCIe Gen3
ECC Internal + GDDR5 GDDR5 GDDR5 Internal + HBM2 Internal + HBM2 GDDR5 GDDR5
Form Factor PCIE Dual Slot PCIE Dual Slot PCIE LP SXM2 PCIE Dual Slot PCIE Dual Slot PCIE LP
Power 300 W 250 W 50-75 W 300 W 250 W 250 W 50-75 W
TESLA PRODUCTS DECODER
![Page 48: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/48.jpg)
48
TESLA VOLTA IS COMMING …
![Page 49: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/49.jpg)
49
TESLA V100THE MOST ADVANCED DATA CENTER GPU EVER BUILT
5,120 CUDA cores
640 NEW Tensor cores
7.5 FP64 TFLOPS | 15 FP32 TFLOPS
120 Tensor TFLOPS
20MB SM RF | 16MB Cache | 16GB HBM2 @ 900 GB/s
300 GB/s NVLink
![Page 50: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/50.jpg)
50
NEW TENSOR CORE BUILT FOR AIDelivering 120 TFLOPS of DL Performance
TENSOR CORE
ALL MAJOR FRAMEWORKSVOLTA-OPTIMIZED cuDNN
MATRIX DATA OPTIMIZATION:
Dense Matrix of Tensor Compute
TENSOR-OP CONVERSION:
FP32 to Tensor Op Data for
Frameworks
TENSOR CORE
VOLTA TENSOR CORE 4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]Optimized For Deep Learning
![Page 51: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/51.jpg)
51
REVOLUTIONARY AI PERFORMANCE3X Faster DL Training Performance
Over 80x DL Training Performance in 3 Years
1x K80cuDNN2
4x M40cuDNN3
8x P100cuDNN6
8x V100cuDNN7
0x
20x
40x
60x
80x
100x
Q1
15
Q3
15
Q2
17
Q2
16
Googlenet Training Performance(Speedup Vs K80)
Speedup v
s K80
85% Scale-Out EfficiencyScales to 64 GPUs with Microsoft
Cognitive Toolkit
0 5 10 15
64X V100
8X V100
8X P100
Multi-Node Training with NCCL2.0(ResNet-50)
ResNet50 Training for 90 Epochs with 1.28M images dataset | Cognitive Toolkit with NCCL 2.0 | V100 performance measured on pre-production
hardware.
1 Hour
7.4 Hours
18 Hours
3X Reduction in Time to Train Over P100
0 10 20
1XV100
1XP100
2XCPU
LSTM Training(Neural Machine Translation)
Neural Machine Translation Training for 13 Epochs |German ->English, WMT15 subset | CPU = 2x Xeon E5 2699 V4 | V100 performance
measured on pre-production hardware.
15 Days
18 Hours
6 Hours
![Page 52: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/52.jpg)
52
SINGLE UNIVERSAL GPU FOR ALL ACCELERATED WORKLOADS
10M Users 40 years of video/day
270M Items sold/day43% on mobile devices
V100 UNIVERSAL GPU
BOOSTS ALL ACCELERATED WORKLOADS
HPC AI Training AI Inference Virtual Desktop
1.5XVs P100
3XVs P100
3XVs P100
2XVs M60
![Page 53: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/53.jpg)
53
TESLA V100 SPECIFICATIONS
Compute 7.5 TF DP ∙ 15 TF SP ∙ 120 TF DL
Memory HBM2: 900 GB/s ∙ 16 GB
InterconnectNVLink (up to 300 GB/s) +
PCIe Gen3 (up to 32 GB/s)
Availability
DGX-1: Q3 2017
OEM : Q4 2017
![Page 54: MACHINE LEARNING WITH NVIDIA AND IBM POWER AI · Refer to “Measurement Config.” slide for HW & SW details, run with real data set AlexnetOWT/GoogLenet use total batch size=1024,](https://reader033.fdocuments.in/reader033/viewer/2022042307/5ed388b381ffeb588c307dcc/html5/thumbnails/54.jpg)