Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated...

HPC Advisory Council Meeting Lugano | 22 March 2016

The Tesla Accelerated Computing Platform Axel Koehler , Principal Solution Architect

2

Agenda

Introduction

TESLA Platform for HPC

TESLA Platform for HYPERSCALE

TESLA Platform for MACHINE LEARNING

TESLA System Software and Tools

Data Center GPU Manager, Docker

3

ENTERPRISE AUTO GAMING DATA CENTER PRO VISUALIZATION

4

TESLA PLATFORM PRODUCT STACK

Software

System Tools & Services

Accelerators

Accelerated Computing

Toolkit

Tesla K80

HPC

Enterprise Services · Data Center GPU Manager · Mesos · Docker

GRID 2.0

Tesla M60, M6

Enterprise Virtualization DL Training

Hyperscale

Hyperscale Suite

Tesla M40 Tesla M4

Web Services

5

TESLA PLATFORM FOR HPC

6

CPU Optimized for Serial Tasks

GPU Accelerator Optimized for Parallel Tasks

HETEROGENEOUS COMPUTING MODEL Complementary Processors Work Together

7

COMMON PROGRAMMING MODELS ACROSS MULTIPLE CPUS

x86

Libraries

Programming Languages

Compiler Directives

AmgX cuBLAS

/

8

370 GPU-Accelerated Applications

www.nvidia.com/appscatalog

9

TESLA K80 World’s Fastest Accelerator for HPC & Data Analytics

0 5 10 15 20 25 30

Tesla K80 Server

Dual CPU Server

# of Days

AMBER Benchmark: PME-JAC-NVE Simulation for 1 microsecond CPU: E5-2698v3 @ 2.3GHz. 64GB System Memory, CentOS 6.2

CUDA Cores 2496

Peak DP 1.9 TFLOPS

Peak DP w/ Boost 2.9 TFLOPS

GDDR5 Memory 24 GB

Bandwidth 480 GB/s

Power 300 W

GPU Boost Dynamic

Simulation Time from 1 Month to 1 Week

5x Faster AMBER Performance

10

VISUALIZE DATA INSTANTLY FOR FASTER SCIENCE

Traditional Slower Time to Discovery

CPU Supercomputer Viz Cluster

Simulation- 1 Week Viz- 1 Day

Multiple Iterations

Time to Discovery = Months

Tesla Platform Faster Time to Discovery

GPU-Accelerated Supercomputer

Visualize while you simulate/without

data transfers

Restart Simulation Instantly Multiple Iterations

Time to Discovery = Weeks

Flexible

Scalable

Interactive

Days

Data Transfer

11

EGL CONTEXT MANAGEMENT

Top systems support OpenGL under X

EGL: Driver based context management

Support for full OpenGL*, not only GL ES

Available in e.g. VTK

New opportunities for CUDA/OpenGL** interop

*Full OpenGL in r355.11; **CUDA interop in r358.7

Leaving it to the driver

Tesla GPU

Tesla driver with EGL

ParaView/VMD

X-server

12

SCALABLE RENDERING AND COMPOSITING

Large-scale (volume) data visualization

Interactive visualization of TB of data

Stand-alone or coupling into simulation

HW Accelerated remote rendering

Plugin for ParaView available

http://www.nvidia-arc.com/products/nvidia-index.html

NVIDIA INDEX

Dataset from NCSA Blue Waters

13

NVLINK : A HIGH-SPEED GPU INTERCONNECT

Whitepaper: http://www.nvidia.com/object/nvlink.html

GPU to CPU via NVLink

NVLink

Pascal CPU (NVLINK Enabled)

DDR Memory

10s-100s GB

HBM 16-32GB

DDR4 50-75 GB/s

1Tbyte/s

PCIe

GPU to GPU via NVLink

Pascal Pascal

CPU (x86)

PCIe Switch

NVlink

14

U.S. TO BUILD TWO FLAGSHIP SUPERCOMPUTERS Powered by the Tesla Platform

100-300 PFLOPS Peak

10x in Scientific App Performance

IBM POWER9 CPU + NVIDIA Volta GPU

NVLink High Speed Interconnect

40 TFLOPS per Node, >3,400 Nodes

2017

Major Step Forward on the Path to Exascale

15

TESLA PLATFORM FOR HYPERSCALE

16

EXABYTES OF CONTENT PRODUCED DAILY User-Generated Content Dominates Web Services

10M Users 40 years of video/day

1.7M Broadcasters Users watch 1.5 hours/day

6B Queries/day 10% use speech

270M Items sold/day 43% on mobile devices

8B Video views/day 400% growth in 6 months

300 hours of video/minute 50% on mobile devices

Challenge: Harnessing the Data Tsunami in Real-time

17

TESLA FOR HYPERSCALE

10M Users 40 years of video/day

270M Items sold/day 43% on mobile devices

TESLA M4 TESLA M40

HYPERSCALE SUITE

POWERFUL: Fastest Deep Learning Performance LOW POWER: Highest Hyperscale Throughput

GPU Accelerated FFmpeg

Image Compute Engine

! !GPU REST Engine

!

18

HTTP (~10ms)

GPU REST Engine (GRE) SDK Accelerated Microservices for Web and Mobile Applications

Supercomputer performance for hyper-scale datacenters

Powerful nodes with low response time (~10ms)

Easy to develop new microservices

Open source, integrates with existing infrastructure

Easy to deploy & scale

Ready-to-run Docker file

GPU REST Engine

Image Classification

Speech Recognition …

Image Scaling

developer.nvidia.com/gre

19

TESLA M4 Highest Throughput Hyperscale Workload

Acceleration

CUDA Cores 1024

Peak SP 2.2 TFLOPS

GDDR5 Memory 4 GB

Bandwidth 88 GB/s

Form Factor PCIe Low Profile

Power 50 – 75 W

Video Processing

Image Processing

Video Transcode

Machine Learning Inference

H.264 & H.265, SD & HD

Stabilization and Enhancements

Resize, Filter, Search, Auto-Enhance

20

JETSON TX1 Embedded

Deep Learning

•  Unmatched performance under 10W •  Advanced tech for autonomous machines •  Smaller than a credit card

JETSON TX1

GPU 1 TFLOP/s 256-core Maxwell

CPU 64-bit ARM A57 CPUs

Memory 4 GB LPDDR4 | 25.6 GB/s

Storage 16 GB eMMC

Wifi/BT 802.11 2x2 ac/BT Ready

Networking 1 Gigabit Ethernet

Size 50mm x 87mm

Interface 400 pin board-to-board connector

21

HYPERSCALE DATACENTER NOW ACCELERATED Tesla Platform

SERVERS FOR TRAINING Scales with Data

SERVERS FOR INFERENCE, WEB SERVICES Scales with Users

!

Exabytes of Content / Day Trained Model Model Deployed on Every Server Billions of Devices

22

TESLA PLATFORM FOR MACHINE LEARNING

23

DEEP LEARNING EVERYWHERE

INTERNET & CLOUD

Image Classification Speech Recognition

Language Translation Language Processing Sentiment Analysis Recommendation

MEDIA & ENTERTAINMENT

Video Captioning Video Search

Real Time Translation

AUTONOMOUS MACHINES

Pedestrian Detection Lane Tracking

Recognize Traffic Sign

SECURITY & DEFENSE

Face Detection Video Surveillance Satellite Imagery

MEDICINE & BIOLOGY

Cancer Cell Detection Diabetic Grading Drug Discovery

24

Why is Deep Learning Hot Now?

Big Data Availability GPU Acceleration New ML Techniques

350 millions images uploaded per day

2.5 Petabytes of customer data hourly

300 hours of video uploaded every minute

25

Image “Volvo XC90”

Image source: “Unsupervised Learning of Hierarchical Representations with Convolutional Deep Belief Networks” ICML 2009 & Comm. ACM 2011. Honglak Lee, Roger Grosse, Rajesh Ranganath, and Andrew Ng.

WHAT IS DEEP LEARNING?

26

DRIVE PX AUTO-PILOT CAR COMPUTER

NVIDIA GPU DEEP LEARNING SUPERCOMPUTER

Neural Net Model

Classified Object

!

Camera Inputs

Cars That See Better … And Learn

27

Camera Inputs

Medical Compute Center (Training)

Hospital/Doctor (Inference)

Classified Object

Med. device inputs

Neural Net Model

! ü

Deep Learning Platform In Medical

Feedback

28

GPUs deliver -- - same or better prediction accuracy - faster results - smaller footprint - lower power

NEURAL NETWORKS GPUS

Inherently Parallel ü ü

Matrix Operations ü ü

FLOPS ü ü

Bandwidth ü ü

GPUS AND DEEP LEARNING

29

NVIDIA CUDA ACCELERATED COMPUTING PLATFORM

WATSON CHAINER THEANO MATCONVNET

TENSORFLOW CNTK TORCH CAFFE

NVIDIA GPU THE ENGINE OF DEEP LEARNING

cuDNN Deep Learning Primitives

IGNITING ARTIFICIAL INTELLIGENCE

§  GPU-accelerated Deep Learning subroutines

§  High performance neural network training

§  Accelerates Major Deep Learning frameworks: Caffe, Theano, Torch

§  Up to 3.5x faster AlexNet training in Caffe than baseline GPU

Millions of Images Trained Per Day

Tiled FFT up to 2x faster than FFT

developer.nvidia.com/cudnn

0

20

40

60

80

100

cuDNN 1 cuDNN 2 cuDNN 3 cuDNN 4

0.0x

0.5x

1.0x

1.5x

2.0x

2.5x

31

NVIDIA DIGITS Interactive Deep Learning GPU Training System

Test Image

Monitor Progress Configure DNN Process Data Visualize Layers

http://developer.nvidia.com/digits

32

TESLA M40 World’s Fastest Accelerator for Deep Learning Training

0 1 2 3 4 5 6 7 8 9 10 11 12 13

GPU Server with 4x TESLA M40

Dual CPU Server

13x Faster Training Caffe

Number of Days

CUDA Cores 3072

Peak SP 7 TFLOPS

GDDR5 Memory 12 GB

Bandwidth 288 GB/s

Power 250W

Reduce Training Time from 13 Days to just 1 Day

Note: Caffe benchmark with AlexNet, CPU server uses 2x E5-2680v3 12 Core 2.5GHz CPU, 128GB System Memory, Ubuntu 14.04

33

Facebook’s deep learning machine Purpose-Built for Deep Learning Training

2x Faster Training for Faster Deployment

2x Larger Networks for Higher Accuracy

Powered by Eight Tesla M40 GPUs

Open Rack Compliant

Serkan Piantino Engineering Director of Facebook AI Research

“Most of the major advances in machine learning and AI in the past few years have been contingent on tapping into powerful

GPUs and huge data sets to build and train advanced models”

34

Designed for AI Computing at large scale

Built on the NVIDIA Tesla Platform

• 8 Tesla M40s deliver aggregate 96 GB GDDR5 memory and 56 teraflops of SP performance

• Leverages world’s leading deep learning platform to tap into frameworks such as Torch and libraries such as cuDNN

Operational Efficiency and Serviceability

• Free-air Cooled Design Optimizes Thermal and Power Efficiency

• Components swappable without tools

• Configurable PCI-e for versatility

35

NCCL

GOAL: •  Build a research library of accelerated collectives that is easily

integrated and topology-aware so as to improve the scalability of multi-GPU applications

APPROACH: •  Pattern the library after MPI’s collectives

•  Handle the intra-node communication in an optimal way

•  Provide the necessary functionality for MPI to build on top to handle inter-node

Accelerating Multi-GPU Communications for Deep Learning

github.com/NVIDIA/nccl

TESLA SYSTEM SOFTWARE AND TOOLS

DATA CENTER GPU MANAGEMENT

Device Management !Board-level GPU

Configuration & Monitoring

•  Device Identification

•  Configuration & Monitoring

•  Clock Management

All GPUs Supported Tesla GPUs Only Tesla GPUs Only

! Active Diagnostics ! Health & Governance

•  GPU Recovery & Isolation

•  System Validation

•  Comprehensive Diagnostics

•  Real-time Monitoring & Analysis

•  Governance Policies

•  Power & Clock Management

Diagnostics, Recovery & System Validation

Proactive Health, Policy & Power Mgmt.

Today Data Center GPU Manager (DCGM)

DATA CENTER GPU MANAGER (DCGM)

Compute Node

Management Node

DC GPU Manager

DC Cluster Management SW

Mgmt. SW Agent

APIs

Network

Tesla Enterprise Driver

Admin

GPU GPU GPU GPU

Admin

CLI

DCGM Available as library & CLI

Ready for integration into ISV Mgmt. Software —  eg. Bright Cluster Manager , IBM Platform Cluster Manager

Ready for integration with HPC Job Schedulers —  eg. Altair PBS Works, Moab & Maui, IBM Platform LSF,

SLURM, Univa GRID Engine

DCGM currently in Public Beta

http://www.nvidia.com/object/data-center-gpu-manager.html

GROWING CONTAINER ADOPTION IN DATA CENTER

“Docker spreads like wildfire, especially in the enterprise” Rightscale 2016 Cloud Survey Report

>2X growth in Docker adoption in a year

Across Enterprise, Cloud and HPC

GPU CONTAINERIZATION USING NVIDIA-DOCKER

Single command-line interface to take care of all deployment steps •  Discovery, Config/setup, Device allocation

Pre-built images on Docker HUB – CUDA, Caffe, Digits •  Reproducible builds across heterogeneous targets

Remote deployment using NVIDIA-Docker-Plugin and REST interface

Key Highlights

v  NVIDIA Docker on GitHUB (experimental) – Available Now

v  Bundled with CUDA Product – Future Versions (In planning)

Axel Koehler [email protected]

Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated...

Documents

Transcript of Axel Koehler , Principal Solution Architect€¦ · GPU REST Engine (GRE) SDK Accelerated...