Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated...
Transcript of Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated...
![Page 1: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/1.jpg)
HPC Advisory Council Meeting Lugano | 22 March 2016
The Tesla Accelerated Computing Platform Axel Koehler , Principal Solution Architect
![Page 2: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/2.jpg)
2
Agenda
Introduction
TESLA Platform for HPC
TESLA Platform for HYPERSCALE
TESLA Platform for MACHINE LEARNING
TESLA System Software and Tools
Data Center GPU Manager, Docker
![Page 3: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/3.jpg)
3
ENTERPRISE AUTO GAMING DATA CENTER PRO VISUALIZATION
![Page 4: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/4.jpg)
4
TESLA PLATFORM PRODUCT STACK
Software
System Tools & Services
Accelerators
Accelerated Computing
Toolkit
Tesla K80
HPC
Enterprise Services · Data Center GPU Manager · Mesos · Docker
GRID 2.0
Tesla M60, M6
Enterprise Virtualization DL Training
Hyperscale
Hyperscale Suite
Tesla M40 Tesla M4
Web Services
![Page 5: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/5.jpg)
5
TESLA PLATFORM FOR HPC
![Page 6: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/6.jpg)
6
CPU Optimized for Serial Tasks
GPU Accelerator Optimized for Parallel Tasks
HETEROGENEOUS COMPUTING MODEL Complementary Processors Work Together
![Page 7: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/7.jpg)
7
COMMON PROGRAMMING MODELS ACROSS MULTIPLE CPUS
x86
Libraries
Programming Languages
Compiler Directives
AmgX cuBLAS
/
![Page 8: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/8.jpg)
8
370 GPU-Accelerated Applications
www.nvidia.com/appscatalog
![Page 9: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/9.jpg)
9
TESLA K80 World’s Fastest Accelerator for HPC & Data Analytics
0 5 10 15 20 25 30
Tesla K80 Server
Dual CPU Server
# of Days
AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2
CUDA Cores 2496
Peak DP 1.9 TFLOPS
Peak DP w/ Boost 2.9 TFLOPS
GDDR5 Memory 24 GB
Bandwidth 480 GB/s
Power 300 W
GPU Boost Dynamic
Simulation Time from 1 Month to 1 Week
5x Faster AMBER Performance
![Page 10: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/10.jpg)
10
VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE
Traditional Slower Time to Discovery
CPU Supercomputer Viz Cluster
Simulation- 1 Week Viz- 1 Day
Multiple Iterations
Time to Discovery = Months
Tesla Platform Faster Time to Discovery
GPU-Accelerated Supercomputer
Visualize while you simulate/without
data transfers
Restart Simulation Instantly Multiple Iterations
Time to Discovery = Weeks
Flexible
Scalable
Interactive
Days
Data Transfer
![Page 11: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/11.jpg)
11
EGL CONTEXT MANAGEMENT
Top systems support OpenGL under X
EGL: Driver based context management
Support for full OpenGL*, not only GL ES
Available in e.g. VTK
New opportunities for CUDA/OpenGL** interop
*Full OpenGL in r355.11; **CUDA interop in r358.7
Leaving it to the driver
Tesla GPU
Tesla driver with EGL
ParaView/VMD
X-server
![Page 12: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/12.jpg)
12
SCALABLE RENDERING AND COMPOSITING
Large-scale (volume) data visualization
Interactive visualization of TB of data
Stand-alone or coupling into simulation
HW Accelerated remote rendering
Plugin for ParaView available
http://www.nvidia-arc.com/products/nvidia-index.html
NVIDIA INDEX
Dataset from NCSA Blue Waters
![Page 13: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/13.jpg)
13
NVLINK : A HIGH-SPEED GPU INTERCONNECT
Whitepaper: http://www.nvidia.com/object/nvlink.html
GPU to CPU via NVLink
NVLink
Pascal CPU (NVLINK Enabled)
DDR Memory
10s-100s GB
HBM 16-32GB
DDR4 50-75 GB/s
1Tbyte/s
PCIe
GPU to GPU via NVLink
Pascal Pascal
CPU (x86)
PCIe Switch
NVlink
![Page 14: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/14.jpg)
14
U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform
100-300 PFLOPS Peak
10x in Scientific App Performance
IBM POWER9 CPU + NVIDIA Volta GPU
NVLink High Speed Interconnect
40 TFLOPS per Node, >3,400 Nodes
2017
Major Step Forward on the Path to Exascale
![Page 15: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/15.jpg)
15
TESLA PLATFORM FOR HYPERSCALE
![Page 16: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/16.jpg)
16
EXABYTES OF CONTENT PRODUCED DAILY User-Generated Content Dominates Web Services
10M Users 40 years of video/day
1.7M Broadcasters Users watch 1.5 hours/day
6B Queries/day 10% use speech
270M Items sold/day 43% on mobile devices
8B Video views/day 400% growth in 6 months
300 hours of video/minute 50% on mobile devices
Challenge: Harnessing the Data Tsunami in Real-time
![Page 17: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/17.jpg)
17
TESLA FOR HYPERSCALE
10M Users 40 years of video/day
270M Items sold/day 43% on mobile devices
TESLA M4 TESLA M40
HYPERSCALE SUITE
POWERFUL: Fastest Deep Learning Performance LOW POWER: Highest Hyperscale Throughput
GPU Accelerated FFmpeg
Image Compute Engine
! !GPU REST Engine
!
![Page 18: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/18.jpg)
18
HTTP (~10ms)
GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications
Supercomputer performance for hyper-scale datacenters
Powerful nodes with low response time (~10ms)
Easy to develop new microservices
Open source, integrates with existing infrastructure
Easy to deploy & scale
Ready-to-run Docker file
GPU REST Engine
Image Classification
Speech Recognition …
Image Scaling
developer.nvidia.com/gre
![Page 19: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/19.jpg)
19
TESLA M4 Highest Throughput Hyperscale Workload
Acceleration
CUDA Cores 1024
Peak SP 2.2 TFLOPS
GDDR5 Memory 4 GB
Bandwidth 88 GB/s
Form Factor PCIe Low Profile
Power 50 – 75 W
Video Processing
Image Processing
Video Transcode
Machine Learning Inference
H.264 & H.265, SD & HD
Stabilization and Enhancements
Resize, Filter, Search, Auto-Enhance
![Page 20: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/20.jpg)
20
JETSON TX1 Embedded
Deep Learning
• Unmatched performance under 10W • Advanced tech for autonomous machines • Smaller than a credit card
JETSON TX1
GPU 1 TFLOP/s 256-core Maxwell
CPU 64-bit ARM A57 CPUs
Memory 4 GB LPDDR4 | 25.6 GB/s
Storage 16 GB eMMC
Wifi/BT 802.11 2x2 ac/BT Ready
Networking 1 Gigabit Ethernet
Size 50mm x 87mm
Interface 400 pin board-to-board connector
![Page 21: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/21.jpg)
21
HYPERSCALE DATACENTER NOW ACCELERATED Tesla Platform
SERVERS FOR TRAINING Scales with Data
SERVERS FOR INFERENCE, WEB SERVICES Scales with Users
!
Exabytes of Content / Day Trained Model Model Deployed on Every Server Billions of Devices
![Page 22: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/22.jpg)
22
TESLA PLATFORM FOR MACHINE LEARNING
![Page 23: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/23.jpg)
23
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD
Image Classification Speech Recognition
Language Translation Language Processing Sentiment Analysis Recommendation
MEDIA & ENTERTAINMENT
Video Captioning Video Search
Real Time Translation
AUTONOMOUS MACHINES
Pedestrian Detection Lane Tracking
Recognize Traffic Sign
SECURITY & DEFENSE
Face Detection Video Surveillance Satellite Imagery
MEDICINE & BIOLOGY
Cancer Cell Detection Diabetic Grading Drug Discovery
![Page 24: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/24.jpg)
24
Why is Deep Learning Hot Now?
Big Data Availability GPU Acceleration New ML Techniques
350 millions images uploaded per day
2.5 Petabytes of customer data hourly
300 hours of video uploaded every minute
![Page 25: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/25.jpg)
25
Image “Volvo XC90”
Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.
WHAT IS DEEP LEARNING?
![Page 26: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/26.jpg)
26
DRIVE PX AUTO-PILOT CAR COMPUTER
NVIDIA GPU DEEP LEARNING SUPERCOMPUTER
Neural Net Model
Classified Object
!
Camera Inputs
Cars That See Better … And Learn
![Page 27: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/27.jpg)
27
Camera Inputs
Medical Compute Center (Training)
Hospital/Doctor (Inference)
Classified Object
Med. device inputs
Neural Net Model
! ü
Deep Learning Platform In Medical
Feedback
![Page 28: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/28.jpg)
28
GPUs deliver -- - same or better prediction accuracy - faster results - smaller footprint - lower power
NEURAL NETWORKS GPUS
Inherently Parallel ü ü
Matrix Operations ü ü
FLOPS ü ü
Bandwidth ü ü
GPUS AND DEEP LEARNING
![Page 29: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/29.jpg)
29
NVIDIA CUDA ACCELERATED COMPUTING PLATFORM
WATSON CHAINER THEANO MATCONVNET
TENSORFLOW CNTK TORCH CAFFE
NVIDIA GPU THE ENGINE OF DEEP LEARNING
![Page 30: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/30.jpg)
cuDNN Deep Learning Primitives
IGNITING ARTIFICIAL INTELLIGENCE
§ GPU-accelerated Deep Learning subroutines
§ High performance neural network training
§ Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch
§ Up to 3.5x faster AlexNet training in Caffe than baseline GPU
Millions of Images Trained Per Day
Tiled FFT up to 2x faster than FFT
developer.nvidia.com/cudnn
0
20
40
60
80
100
cuDNN 1 cuDNN 2 cuDNN 3 cuDNN 4
0.0x
0.5x
1.0x
1.5x
2.0x
2.5x
![Page 31: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/31.jpg)
31
NVIDIA DIGITS Interactive Deep Learning GPU Training System
Test Image
Monitor Progress Configure DNN Process Data Visualize Layers
http://developer.nvidia.com/digits
![Page 32: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/32.jpg)
32
TESLA M40 World’s Fastest Accelerator for Deep Learning Training
0 1 2 3 4 5 6 7 8 9 10 11 12 13
GPU Server with 4x TESLA M40
Dual CPU Server
13x Faster Training Caffe
Number of Days
CUDA Cores 3072
Peak SP 7 TFLOPS
GDDR5 Memory 12 GB
Bandwidth 288 GB/s
Power 250W
Reduce Training Time from 13 Days to just 1 Day
Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04
![Page 33: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/33.jpg)
33
Facebook’s deep learning machine Purpose-Built for Deep Learning Training
2x Faster Training for Faster Deployment
2x Larger Networks for Higher Accuracy
Powered by Eight Tesla M40 GPUs
Open Rack Compliant
Serkan Piantino Engineering Director of Facebook AI Research
“Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful
GPUs and huge data sets to build and train advanced models”
![Page 34: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/34.jpg)
34
Designed for AI Computing at large scale
Built on the NVIDIA Tesla Platform
• 8 Tesla M40s deliver aggregate 96 GB GDDR5 memory and 56 teraflops of SP performance
• Leverages world’s leading deep learning platform to tap into frameworks such as Torch and libraries such as cuDNN
Operational Efficiency and Serviceability
• Free-air Cooled Design Optimizes Thermal and Power Efficiency
• Components swappable without tools
• Configurable PCI-e for versatility
![Page 35: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/35.jpg)
35
NCCL
GOAL: • Build a research library of accelerated collectives that is easily
integrated and topology-aware so as to improve the scalability of multi-GPU applications
APPROACH: • Pattern the library after MPI’s collectives
• Handle the intra-node communication in an optimal way
• Provide the necessary functionality for MPI to build on top to handle inter-node
Accelerating Multi-GPU Communications for Deep Learning
github.com/NVIDIA/nccl
![Page 36: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/36.jpg)
TESLA SYSTEM SOFTWARE AND TOOLS
![Page 37: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/37.jpg)
DATA CENTER GPU MANAGEMENT
Device Management !Board-level GPU
Configuration & Monitoring
• Device Identification
• Configuration & Monitoring
• Clock Management
All GPUs Supported Tesla GPUs Only Tesla GPUs Only
! Active Diagnostics ! Health & Governance
• GPU Recovery & Isolation
• System Validation
• Comprehensive Diagnostics
• Real-time Monitoring & Analysis
• Governance Policies
• Power & Clock Management
Diagnostics, Recovery & System Validation
Proactive Health, Policy & Power Mgmt.
Today Data Center GPU Manager (DCGM)
![Page 38: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/38.jpg)
DATA CENTER GPU MANAGER (DCGM)
Compute Node
Management Node
DC GPU Manager
DC Cluster Management SW
Mgmt. SW Agent
APIs
Network
Tesla Enterprise Driver
Admin
GPU GPU GPU GPU
Admin
CLI
DCGM Available as library & CLI
Ready for integration into ISV Mgmt. Software — eg. Bright Cluster Manager , IBM Platform Cluster Manager
Ready for integration with HPC Job Schedulers — eg. Altair PBS Works, Moab & Maui, IBM Platform LSF,
SLURM, Univa GRID Engine
DCGM currently in Public Beta
http://www.nvidia.com/object/data-center-gpu-manager.html
![Page 39: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/39.jpg)
GROWING CONTAINER ADOPTION IN DATA CENTER
“Docker spreads like wildfire, especially in the enterprise” Rightscale 2016 Cloud Survey Report
>2X growth in Docker adoption in a year
Across Enterprise, Cloud and HPC
![Page 40: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/40.jpg)
GPU CONTAINERIZATION USING NVIDIA-DOCKER
Single command-line interface to take care of all deployment steps • Discovery, Config/setup, Device allocation
Pre-built images on Docker HUB – CUDA, Caffe, Digits • Reproducible builds across heterogeneous targets
Remote deployment using NVIDIA-Docker-Plugin and REST interface
Key Highlights
v NVIDIA Docker on GitHUB (experimental) – Available Now
v Bundled with CUDA Product – Future Versions (In planning)
![Page 41: Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications Supercomputer performance for hyper-scale datacenters](https://reader030.fdocuments.in/reader030/viewer/2022041014/5ec5a0f456bfdd7f9a5b6360/html5/thumbnails/41.jpg)
Axel Koehler [email protected]