IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64...

IBM, NVIDIA, and Client under NDA

…

Internet Services Medicine Media & Entertainment Security & Defense Autonomous Machines

➢ Cancer cell detection

➢ Diabetic grading

➢ Drug discovery

➢ Pedestrian detection

➢ Lane tracking

➢ Recognize traffic signs

➢ Face recognition

➢ Video surveillance

➢ Cyber security

➢ Video captioning

➢ Content based search

➢ Real time translation

➢ Image/Video classification

➢ Speech recognition

➢ Natural language processing

IBM, NVIDIA, and Client under NDA

•••

* Bidirectional bandwidths, GPU links 40+40GB/sec

PowerAI: World’s Fastest Platform for Enterprise

Deep Learning Frameworks & Building Blocks

Accelerated Servers and

Infrastructure for Scaling

PowerAI: World’s Fastest Platform for Enterprise

7© 2016 IBM Corporation

✓✓✓✓✓✓✓✓


Data centers near every major metro area enabling low-latency connectivity to cloud infrastructure.


Recently Announced!

10

Leveraging Cloud Computing for Deep Learning

Validate/Test

Train

Preprocess

● Advantages to training models in cloud

● Effective training on Rescale platform

● Leveraging IBM Cloud and P100s

11


Validate/Test

Train

Preprocess




12

Scaling Up - Training with On Demand GPUs

Validate/Test

Train

Preprocess

Validate/Test

Preprocess

Train

● Higher-capacity GPUs● Multi-GPU● Multi-node

13

Cloud Provides Choice in Sizing GPU Resources

ExploratoryData Analysis,Debugging models

Big Models,Big Inputs, Big Batches

Big + Data Parallel

OpenCV,Numba,Small batch sizes

ResNet152, batch size=64

Google Brain Grasping Dataset

K80 P100 P100s

Workflow use case

GPU resources

Examples

14

Scaling Out - Model Design Exploration

Validate/Test

Train

Preprocess

ResNet101Batch size=128Learning rate=0.01

Validate/Test

Train

Preprocess


Validate/Test

Train

Preprocess


Validate/Test

Train

Preprocess


. . .

Dynamically allocate many GPU clusters for large parameter sweeps

15

Cloud scalability will cut result turnaround time

J1

J2

J3

J4

J1

J2

J3

J4

Time Time

J1

J2

J3

J4

Time

J1

J2

J3

J4

J1 J2 J3 J4

J1 J2 J3 J4

Concurrent run

Concurrent run

GPU sizeJob queue

Job queued

Job submitted concurrently

Multi GPU Model scalability

On-premise Potential with Cloud

16

Range of Storage for Large Datasets

Validate/Test

Train

Local storage or distributed FS for

training set

Object Storage

Object storage for different data preparations in

experiment

Archival (cold) Storage

Archival storage for reproducability and compliance

I/O PerformanceAnd $$$

17

Centralized location to break silos, Minimize file transfer and allow for data and tools connectivity

Generated Stored Managed Processed

Shared

● Cross organization collaboration

● Efficient cross geo transfer

18

Challenges: Data Transfer for On Demand ResourcesOn-premise Cloud

Bulk import,Streaming updates

Sync to cluster on training start

ObjectStorage

Sync back on cluster teardown

19

Challenges: Data Transfer for On Demand ResourcesOn-premise Cloud

Bulk import,Streaming updates

Sync to cluster on training start

ObjectStorage

Sync back on cluster teardown

● Multithreaded Xfer Tools● Streaming from Object

Storage● DFS

WAN accelerators

20


Validate/Test

Train

Preprocess




21

Rescale’s ScaleX on IBM Cloud: Turnkey DL SolutionFull-stack solution, seamless SaaS Deployment

ScaleX PlatformAutomated HPC IT Deployment

Seamless Hybrid, Multi-Core Environment

ScaleX SaaSPurpose built portals, intuitive workflows

IBM Cloud

ScaleX SW Library+180 Turn-key SW Solutions

Open Source Commercial ISV

Custom In-house

Engineers & Data Scientists

IT Admin & Managers Partners

+/- Customer’s Existing On

premise HW

Bare Metal Servers

Virtual Servers

22

Rescale’s ScaleX on IBM Cloud: Turnkey DL SolutionIncreasing accessibility, usability and utilization

ScaleX PlatformAutomated HPC IT Deployment

Seamless Hybrid, Multi-Core Environment

ScaleX SaaSPurpose built portals, intuitive workflows

ScaleX SW Library+180 Turn-key SW Solutions

Open Source Commercial ISV

Custom In-house

Engineers & Data Scientists

IT Admin & Managers Partners

Specify input file2

Select SW

Select Compute

SaaS Access1

Run job5

3

4

+/- Customer’s Existing On

premise HW

IBM CloudBare Metal

ServersVirtual

Servers

23

Training Deep Learning Models on Rescale● Case study: ILSVRC Distributed

TensorFlow training

imagenet_distributed_train --batch_size=64 --data_dir=$DATADIR --train_dir=out/train --job_name=worker --task_id=1 --ps_hosts="node1:2220...node16:2220" --worker_hosts="node1:2222,node1:2223...node16:2240"…imagenet_distributed_train --batch_size=64 --data_dir=$DATADIR --train_dir=out/train --job_name=ps --task_id=1 --ps_hosts="node1:2220...node16:2220" --worker_hosts="node1:2222,node1:2223...node16:2240"

● Data prep: tfrecords synced on job start to local storage

● Host strings: synthesize from machinefiles● Worker launch: use mpirun to launch

worker and parameter servers● Security: Isolated cluster network

24

Scale out: Hyperparameter Search as a Service (GTC2016)

Validate/Test

Train

Preprocess


Validate/Test

Train

Preprocess


Validate/Test

Train

Preprocess


Validate/Test

Train

Preprocess


. . .

HyperparameterTemplate

Spawn shared nothing clusters,

evaluate in parallel

Summarize Results

25

Scale out: Bayesian Optimization as a Service (GTC2016)

Validate/Test

Train

Preprocess

Batch size=128LR=0.01

Validate/Test

Train

Preprocess

Validate/Test

Train

Preprocess

Validate/Test

Train

Preprocess

. . .Hyperparameter

Optimizer(Spearmint, SMAC)

Template


Batch size=64LR=0.1


Precision=0.23 Runtime=145500 Precision=0.30

Runtime=105500

Precision=0.35Runtime=200566

Precision=0.20 Runtime=97000

26


Validate/Test

Train

Preprocess




27

Deep Learning on IBM Cloud and Rescale

Dataset and Code Management

Workflows and Job Management

Design of ExperimentsOptimization

Persistent Cluster

Software Libraries

Web portal, API, CLI

Hybrid Cluster Management

Rescale

Bare Metal Servers (P100s and K80s) Cloud Object StorageIBM Cloud

28

Scale Up to P100s on IBM Cloud

● TensorFlow InceptionV3● ~½ training time compared to previous generation K80s

29

IBM Bare Metal Networking + P100s = Fast Multi-Node

● Multi-Node TensorFlow Distributed (InceptionV3)● ~1.3x faster vs. 4x more K80s on competing provider

30

PyTorch Large Model Training

● P100 enable larger batch sizes for big networks

31

XGBoost - Boosted Tree Construction

● P100 >1.25x faster than CPU and K80● K80 acceleration does not provide benefit

over CPU

● P100 1.5x faster than CPU

32

Streaming Training Data from Object Storage

Data sync (25 minutes, 150 GB)

Training on streaming data

● File system loads images on demand from IBM Object Storage● Large blocks (128MB) to ensure efficient download● Local cache sized to hold entire dataset so subsequent epochs are local

Training on cached data

33

Questions?

Rescale● Mark Whitney, Head of Deep Learning, [email protected]● Tyler Smith, Head of Partnerships, [email protected]

IBM Cloud● Jerry Gutierrez, Global HPC Sales Lead, [email protected]● Casey Knott, IBM Cloud Platform Specialist, [email protected]

Rescale on IBM Cloud● http://www.rescale.com/ibm/

mailto:[email protected]




IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64...

Documents

Transcript of IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64...