IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH...

16
Dr. Adolf Hohl, SA AUTO Datacenter EMEA IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX DGX REFERENCE ARCHITECTURE SOLUTION

Transcript of IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH...

Page 1: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

Dr. Adolf Hohl, SA AUTO Datacenter EMEA

IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGXDGX REFERENCE ARCHITECTURE SOLUTION

Page 2: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

2

Page 3: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

3

HOW DO WE TRAIN THESE NETWORKS?

• SINGLE GPU CODE is a dying specie

• All our AV DL code is made for MULTIGPU and scalable :

• Runs on Single GPU

• Runs on Multi GPU

• Runs on Multi Nodes with Multiple GPUs

• We use a Cluster for DL Training

• Just ONE codebase

• Just ONE way to orchestrate

I talked about these in a previousIBM Meetup (https://www.youtube.com/watch?v=8xj4CK4ZUMQ)

Page 4: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

4

THE TRUE TCO OF AN AI PLATFORM

Study & exploration

Platform Design Productive Experi-

mentation

HW & SW Integra-

tion

Trouble-shooting

Software eng’g

Software optimiz-

ation

Design and Build for

Scale

Software re-optimiz-

ation

InsightsTraining at Scale

1. Designing and Building an AI Compute Platform – from Scratch

OPEX

CAPEX

Day 1

Month 3

Time and budget spent on things other than data science

“DIY” TCO

Page 5: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

5

NVIDIA DGX-1: THE ESSENTIAL TOOL OF AIFastest Start, Effortless Productivity, Revolutionary Performance

1 PFLOPS | 8x Tesla V100 32GB | 300 GB/s NVLink Hybrid Cube Mesh

2x Xeon | 8 TB RAID 0 | Quad 100Gbps, Dual 10GbE | 3U — 3500W

8 TB SSD 8 x Tesla V100 16GB32GB

Page 6: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

6

STACKING DGXAggregating Ressources – Scaling Out

InfiniBand/Converged Ethernet

Interconnected Nodes

• Precondition to Scale

• Precondition for effective MultiNode-MultiGPU scaling

• Precondition to aggregate ressources which were left over

Storage

Page 7: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

7

SCALING WITH HOROVODOne Process per GPU – One Datapipeline per GPU

InfiniBand/Converged Ethernet

Storage

Tower(indiv. process)

Page 8: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

8

SOFTWARE STACK TO SCALE OUT

NVIDIA GPU CLOUD (NGC)

Ready to scale

Optimized

MPI, Horovod

NCCL

ngc.nvidia.com

IBM PowerAI

ibmcom/powerai

Ready to scale

Optimized

hub.docker.com/r/ibmcom/powerai/

Page 9: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

10

IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX

10

• NVIDIA DGX-1 | up to 9x DGX-1 Systems

• IBM Spectrum Scale NVMe Appliance| 40GB/s per

node, 120GB/s in 6RU| 300TB per node

• NETWORK: Mellanox SB7700 Switch | 2x EDR IB with

RDMA

• NVIDIA DGX SOFTWARE STACK | NVIDIA Optimized

Frameworks

• IBM: High performance, low latency, parallel file

system

• IBM: Extensible and composable

HARDWARE

SOFTWARE

The Engine to Power Your AI Data Pipeline

Page 10: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

12

IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX: SCALABLE REFERENCE ARCHITECTURES

Scaling with NVIDIA DGX-1

• Start with a single IBM Spectrum Scale NVMe and a single DGX-1

• Grow capacity in a cost-effective, modular approach

• Each config delivers balanced performance, capacity and scale

• IBM Spectrum Scale NVME all-flash appliance is power efficient to allow maximum flexibility when designing rack space and addressing power requirements

Page 11: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

13

Performance at Scale

For multiple DGX-1 servers, IBM

Spectrum Scale on NVMe architecture

demonstrates linear scale up to full

saturation of all DGX-1 server GPUs

The multi-DGX server image processing

rates shown demonstrate scalability for

Inception-v4, ResNet-152, VGG-16,

Inception-v3, ResNet-50, GoogLeNet

and AlexNet models

IBM STORAGE WITH NVIDIA DGX: FULLY-OPTIMIZED AND QUALIFIED

Page 12: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

14

BUSINESS IMPACT OFIBM SPECTRUM STORAGE FOR AI

WITH NVIDIA DGX

Page 13: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

15

THE IMPACT OF IBM STORAGE + NVIDIA DGXON TIMELINE

Study & exploration

Platform DesignProductive

Experi-mentation

Install and Deploy DGX RA

SOLUTION

Trouble-shooting

Software eng’g

Software optimiz-

ation

Design and Build for

Scale

Software re-

optimiz-ation

InsightsTraining at Scale

2. Deploying an Integrated, Full-Stack AI Solution using DGX Systems

Day 1

Month 3

“DIY” TCO

CAPEX

DGX TCOdeployment

cycle shortened

Wasted time/effort - eliminated

Page 14: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

16

Study & exploration

Insights

2. Deploying an Integrated, Full-Stack AI Solution using DGX Systems

Day 1

Week 1

Install and Deploy DGX RA

SOLUTION

CAPEX

Productive Experi-

mentation

Training at Scale

“DIY” TCO

DGX TCO

THE IMPACT OF IBM STORAGE + NVIDIA DGXON TIMELINE

Page 15: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH

17

IBM & NVIDIA REFERENCE ARCHITECTUREValidated design for deploying DGX at-scale with IBM Storage

Download athttps://bit.ly/2GcYbgO

Learn more about DGX RA Solutions at:https://bit.ly/2OpXYeC

Page 16: IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX · 2019-05-08 · IBM SPECTRUM STORAGE FOR AI WITH NVIDIA DGX ... Inception-v3, ResNet-50, GoogLeNet and AlexNet models IBM STORAGE WITH