IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64...

34

Transcript of IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64...

Page 1: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500
Page 2: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

IBM, NVIDIA, and Client under NDA

Page 3: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

Internet Services Medicine Media & Entertainment Security & Defense Autonomous Machines

➢ Cancer cell detection

➢ Diabetic grading

➢ Drug discovery

➢ Pedestrian detection

➢ Lane tracking

➢ Recognize traffic signs

➢ Face recognition

➢ Video surveillance

➢ Cyber security

➢ Video captioning

➢ Content based search

➢ Real time translation

➢ Image/Video classification

➢ Speech recognition

➢ Natural language processing

Page 4: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

IBM, NVIDIA, and Client under NDA

•••

* Bidirectional bandwidths, GPU links 40+40GB/sec

Page 5: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

PowerAI: World’s Fastest Platform for Enterprise

Page 6: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

Deep Learning Frameworks & Building Blocks

Accelerated Servers and

Infrastructure for Scaling

PowerAI: World’s Fastest Platform for Enterprise

Page 7: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

7© 2016 IBM Corporation

✓✓✓✓✓✓✓✓

Page 8: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

8© 2016 IBM Corporation

Data centers near every major metro area enabling low-latency connectivity to cloud infrastructure.

Page 9: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

9© 2016 IBM Corporation

Recently Announced!

Page 10: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

10

Leveraging Cloud Computing for Deep Learning

Validate/Test

Train

Preprocess

● Advantages to training models in cloud

● Effective training on Rescale platform

● Leveraging IBM Cloud and P100s

Page 11: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

11

Leveraging Cloud Computing for Deep Learning

Validate/Test

Train

Preprocess

● Advantages to training models in cloud

● Effective training on Rescale platform

● Leveraging IBM Cloud and P100s

Page 12: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

12

Scaling Up - Training with On Demand GPUs

Validate/Test

Train

Preprocess

Validate/Test

Preprocess

Train

● Higher-capacity GPUs● Multi-GPU● Multi-node

Page 13: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

13

Cloud Provides Choice in Sizing GPU Resources

ExploratoryData Analysis,Debugging models

Big Models,Big Inputs, Big Batches

Big + Data Parallel

OpenCV,Numba,Small batch sizes

ResNet152, batch size=64

Google Brain Grasping Dataset

K80 P100 P100s

Workflow use case

GPU resources

Examples

Page 14: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

14

Scaling Out - Model Design Exploration

Validate/Test

Train

Preprocess

ResNet101Batch size=128Learning rate=0.01

Validate/Test

Train

Preprocess

ResNet101Batch size=256Learning rate=0.01

Validate/Test

Train

Preprocess

ResNet152Batch size=256Learning rate=0.01

Validate/Test

Train

Preprocess

ResNet152Batch size=128Learning rate=0.1

. . .

Dynamically allocate many GPU clusters for large parameter sweeps

Page 15: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

15

Cloud scalability will cut result turnaround time

J1

J2

J3

J4

J1

J2

J3

J4

Time Time

J1

J2

J3

J4

Time

J1

J2

J3

J4

J1 J2 J3 J4

J1 J2 J3 J4

Concurrent run

Concurrent run

GPU sizeJob queue

Job queued

Job submitted concurrently

Multi GPU Model scalability

On-premise Potential with Cloud

Page 16: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

16

Range of Storage for Large Datasets

Validate/Test

Train

Local storage or distributed FS for

training set

Object Storage

Object storage for different data preparations in

experiment

Archival (cold) Storage

Archival storage for reproducability and compliance

I/O PerformanceAnd $$$

Page 17: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

17

Centralized location to break silos, Minimize file transfer and allow for data and tools connectivity

Generated Stored Managed Processed

Shared

● Cross organization collaboration

● Efficient cross geo transfer

Page 18: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

18

Challenges: Data Transfer for On Demand ResourcesOn-premise Cloud

Bulk import,Streaming updates

Sync to cluster on training start

ObjectStorage

Sync back on cluster teardown

Page 19: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

19

Challenges: Data Transfer for On Demand ResourcesOn-premise Cloud

Bulk import,Streaming updates

Sync to cluster on training start

ObjectStorage

Sync back on cluster teardown

● Multithreaded Xfer Tools● Streaming from Object

Storage● DFS

WAN accelerators

Page 20: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

20

Leveraging Cloud Computing for Deep Learning

Validate/Test

Train

Preprocess

● Advantages to training models in cloud

● Effective training on Rescale platform

● Leveraging IBM Cloud and P100s

Page 21: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

21

Rescale’s ScaleX on IBM Cloud: Turnkey DL SolutionFull-stack solution, seamless SaaS Deployment

ScaleX PlatformAutomated HPC IT Deployment

Seamless Hybrid, Multi-Core Environment

ScaleX SaaSPurpose built portals, intuitive workflows

IBM Cloud

ScaleX SW Library+180 Turn-key SW Solutions

Open Source Commercial ISV

Custom In-house

Engineers & Data Scientists

IT Admin & Managers Partners

+/- Customer’s Existing On

premise HW

Bare Metal Servers

Virtual Servers

Page 22: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

22

Rescale’s ScaleX on IBM Cloud: Turnkey DL SolutionIncreasing accessibility, usability and utilization

ScaleX PlatformAutomated HPC IT Deployment

Seamless Hybrid, Multi-Core Environment

ScaleX SaaSPurpose built portals, intuitive workflows

ScaleX SW Library+180 Turn-key SW Solutions

Open Source Commercial ISV

Custom In-house

Engineers & Data Scientists

IT Admin & Managers Partners

Specify input file2

Select SW

Select Compute

SaaS Access1

Run job5

3

4

+/- Customer’s Existing On

premise HW

IBM CloudBare Metal

ServersVirtual

Servers

Page 23: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

23

Training Deep Learning Models on Rescale● Case study: ILSVRC Distributed

TensorFlow training

imagenet_distributed_train --batch_size=64 --data_dir=$DATADIR --train_dir=out/train --job_name=worker --task_id=1 --ps_hosts="node1:2220...node16:2220" --worker_hosts="node1:2222,node1:2223...node16:2240"…imagenet_distributed_train --batch_size=64 --data_dir=$DATADIR --train_dir=out/train --job_name=ps --task_id=1 --ps_hosts="node1:2220...node16:2220" --worker_hosts="node1:2222,node1:2223...node16:2240"

● Data prep: tfrecords synced on job start to local storage

● Host strings: synthesize from machinefiles● Worker launch: use mpirun to launch

worker and parameter servers● Security: Isolated cluster network

Page 24: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

24

Scale out: Hyperparameter Search as a Service (GTC2016)

Validate/Test

Train

Preprocess

ResNet101Batch size=128Learning rate=0.01

Validate/Test

Train

Preprocess

ResNet101Batch size=256Learning rate=0.01

Validate/Test

Train

Preprocess

ResNet152Batch size=256Learning rate=0.01

Validate/Test

Train

Preprocess

ResNet152Batch size=128Learning rate=0.1

. . .

HyperparameterTemplate

Spawn shared nothing clusters,

evaluate in parallel

Summarize Results

Page 25: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

25

Scale out: Bayesian Optimization as a Service (GTC2016)

Validate/Test

Train

Preprocess

Batch size=128LR=0.01

Validate/Test

Train

Preprocess

Validate/Test

Train

Preprocess

Validate/Test

Train

Preprocess

. . .Hyperparameter

Optimizer(Spearmint, SMAC)

Template

Batch size=128LR=0.1

Batch size=64LR=0.1

Batch size=256LR=0.1

Precision=0.23 Runtime=145500 Precision=0.30

Runtime=105500

Precision=0.35Runtime=200566

Precision=0.20 Runtime=97000

Page 26: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

26

Leveraging Cloud Computing for Deep Learning

Validate/Test

Train

Preprocess

● Advantages to training models in cloud

● Effective training on Rescale platform

● Leveraging IBM Cloud and P100s

Page 27: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

27

Deep Learning on IBM Cloud and Rescale

Dataset and Code Management

Workflows and Job Management

Design of ExperimentsOptimization

Persistent Cluster

Software Libraries

Web portal, API, CLI

Hybrid Cluster Management

Rescale

Bare Metal Servers (P100s and K80s) Cloud Object StorageIBM Cloud

Page 28: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

28

Scale Up to P100s on IBM Cloud

● TensorFlow InceptionV3● ~½ training time compared to previous generation K80s

Page 29: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

29

IBM Bare Metal Networking + P100s = Fast Multi-Node

● Multi-Node TensorFlow Distributed (InceptionV3)● ~1.3x faster vs. 4x more K80s on competing provider

Page 30: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

30

PyTorch Large Model Training

● P100 enable larger batch sizes for big networks

Page 31: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

31

XGBoost - Boosted Tree Construction

● P100 >1.25x faster than CPU and K80● K80 acceleration does not provide benefit

over CPU

● P100 1.5x faster than CPU

Page 32: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

32

Streaming Training Data from Object Storage

Data sync (25 minutes, 150 GB)

Training on streaming data

● File system loads images on demand from IBM Object Storage● Large blocks (128MB) to ensure efficient download● Local cache sized to hold entire dataset so subsequent epochs are local

Training on cached data

Page 33: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

33

Questions?

Rescale● Mark Whitney, Head of Deep Learning, [email protected]● Tyler Smith, Head of Partnerships, [email protected]

IBM Cloud● Jerry Gutierrez, Global HPC Sales Lead, [email protected]● Casey Knott, IBM Cloud Platform Specialist, [email protected]

Rescale on IBM Cloud● http://www.rescale.com/ibm/

Page 34: IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64 LR=0.1 Batch size=256 LR=0.1 Precision=0.23 Runtime=145500 Precision=0.30 Runtime=105500

34