IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64...
Transcript of IBM, NVIDIA, and Client under NDA...(Spearmint, SMAC) Template Batch size=128 LR=0.1 Batch size=64...
IBM, NVIDIA, and Client under NDA
…
Internet Services Medicine Media & Entertainment Security & Defense Autonomous Machines
➢ Cancer cell detection
➢ Diabetic grading
➢ Drug discovery
➢ Pedestrian detection
➢ Lane tracking
➢ Recognize traffic signs
➢ Face recognition
➢ Video surveillance
➢ Cyber security
➢ Video captioning
➢ Content based search
➢ Real time translation
➢ Image/Video classification
➢ Speech recognition
➢ Natural language processing
IBM, NVIDIA, and Client under NDA
•••
* Bidirectional bandwidths, GPU links 40+40GB/sec
PowerAI: World’s Fastest Platform for Enterprise
Deep Learning Frameworks & Building Blocks
Accelerated Servers and
Infrastructure for Scaling
PowerAI: World’s Fastest Platform for Enterprise
7© 2016 IBM Corporation
✓✓✓✓✓✓✓✓
8© 2016 IBM Corporation
Data centers near every major metro area enabling low-latency connectivity to cloud infrastructure.
9© 2016 IBM Corporation
Recently Announced!
10
Leveraging Cloud Computing for Deep Learning
Validate/Test
Train
Preprocess
● Advantages to training models in cloud
● Effective training on Rescale platform
● Leveraging IBM Cloud and P100s
11
Leveraging Cloud Computing for Deep Learning
Validate/Test
Train
Preprocess
● Advantages to training models in cloud
● Effective training on Rescale platform
● Leveraging IBM Cloud and P100s
12
Scaling Up - Training with On Demand GPUs
Validate/Test
Train
Preprocess
Validate/Test
Preprocess
Train
● Higher-capacity GPUs● Multi-GPU● Multi-node
13
Cloud Provides Choice in Sizing GPU Resources
ExploratoryData Analysis,Debugging models
Big Models,Big Inputs, Big Batches
Big + Data Parallel
OpenCV,Numba,Small batch sizes
ResNet152, batch size=64
Google Brain Grasping Dataset
K80 P100 P100s
Workflow use case
GPU resources
Examples
14
Scaling Out - Model Design Exploration
Validate/Test
Train
Preprocess
ResNet101Batch size=128Learning rate=0.01
Validate/Test
Train
Preprocess
ResNet101Batch size=256Learning rate=0.01
Validate/Test
Train
Preprocess
ResNet152Batch size=256Learning rate=0.01
Validate/Test
Train
Preprocess
ResNet152Batch size=128Learning rate=0.1
. . .
Dynamically allocate many GPU clusters for large parameter sweeps
15
Cloud scalability will cut result turnaround time
J1
J2
J3
J4
J1
J2
J3
J4
Time Time
J1
J2
J3
J4
Time
J1
J2
J3
J4
J1 J2 J3 J4
J1 J2 J3 J4
Concurrent run
Concurrent run
GPU sizeJob queue
Job queued
Job submitted concurrently
Multi GPU Model scalability
On-premise Potential with Cloud
16
Range of Storage for Large Datasets
Validate/Test
Train
Local storage or distributed FS for
training set
Object Storage
Object storage for different data preparations in
experiment
Archival (cold) Storage
Archival storage for reproducability and compliance
I/O PerformanceAnd $$$
17
Centralized location to break silos, Minimize file transfer and allow for data and tools connectivity
Generated Stored Managed Processed
Shared
● Cross organization collaboration
● Efficient cross geo transfer
18
Challenges: Data Transfer for On Demand ResourcesOn-premise Cloud
Bulk import,Streaming updates
Sync to cluster on training start
ObjectStorage
Sync back on cluster teardown
19
Challenges: Data Transfer for On Demand ResourcesOn-premise Cloud
Bulk import,Streaming updates
Sync to cluster on training start
ObjectStorage
Sync back on cluster teardown
● Multithreaded Xfer Tools● Streaming from Object
Storage● DFS
WAN accelerators
20
Leveraging Cloud Computing for Deep Learning
Validate/Test
Train
Preprocess
● Advantages to training models in cloud
● Effective training on Rescale platform
● Leveraging IBM Cloud and P100s
21
Rescale’s ScaleX on IBM Cloud: Turnkey DL SolutionFull-stack solution, seamless SaaS Deployment
ScaleX PlatformAutomated HPC IT Deployment
Seamless Hybrid, Multi-Core Environment
ScaleX SaaSPurpose built portals, intuitive workflows
IBM Cloud
ScaleX SW Library+180 Turn-key SW Solutions
Open Source Commercial ISV
Custom In-house
Engineers & Data Scientists
IT Admin & Managers Partners
+/- Customer’s Existing On
premise HW
Bare Metal Servers
Virtual Servers
22
Rescale’s ScaleX on IBM Cloud: Turnkey DL SolutionIncreasing accessibility, usability and utilization
ScaleX PlatformAutomated HPC IT Deployment
Seamless Hybrid, Multi-Core Environment
ScaleX SaaSPurpose built portals, intuitive workflows
ScaleX SW Library+180 Turn-key SW Solutions
Open Source Commercial ISV
Custom In-house
Engineers & Data Scientists
IT Admin & Managers Partners
Specify input file2
Select SW
Select Compute
SaaS Access1
Run job5
3
4
+/- Customer’s Existing On
premise HW
IBM CloudBare Metal
ServersVirtual
Servers
23
Training Deep Learning Models on Rescale● Case study: ILSVRC Distributed
TensorFlow training
imagenet_distributed_train --batch_size=64 --data_dir=$DATADIR --train_dir=out/train --job_name=worker --task_id=1 --ps_hosts="node1:2220...node16:2220" --worker_hosts="node1:2222,node1:2223...node16:2240"…imagenet_distributed_train --batch_size=64 --data_dir=$DATADIR --train_dir=out/train --job_name=ps --task_id=1 --ps_hosts="node1:2220...node16:2220" --worker_hosts="node1:2222,node1:2223...node16:2240"
● Data prep: tfrecords synced on job start to local storage
● Host strings: synthesize from machinefiles● Worker launch: use mpirun to launch
worker and parameter servers● Security: Isolated cluster network
24
Scale out: Hyperparameter Search as a Service (GTC2016)
Validate/Test
Train
Preprocess
ResNet101Batch size=128Learning rate=0.01
Validate/Test
Train
Preprocess
ResNet101Batch size=256Learning rate=0.01
Validate/Test
Train
Preprocess
ResNet152Batch size=256Learning rate=0.01
Validate/Test
Train
Preprocess
ResNet152Batch size=128Learning rate=0.1
. . .
HyperparameterTemplate
Spawn shared nothing clusters,
evaluate in parallel
Summarize Results
25
Scale out: Bayesian Optimization as a Service (GTC2016)
Validate/Test
Train
Preprocess
Batch size=128LR=0.01
Validate/Test
Train
Preprocess
Validate/Test
Train
Preprocess
Validate/Test
Train
Preprocess
. . .Hyperparameter
Optimizer(Spearmint, SMAC)
Template
Batch size=128LR=0.1
Batch size=64LR=0.1
Batch size=256LR=0.1
Precision=0.23 Runtime=145500 Precision=0.30
Runtime=105500
Precision=0.35Runtime=200566
Precision=0.20 Runtime=97000
26
Leveraging Cloud Computing for Deep Learning
Validate/Test
Train
Preprocess
● Advantages to training models in cloud
● Effective training on Rescale platform
● Leveraging IBM Cloud and P100s
27
Deep Learning on IBM Cloud and Rescale
Dataset and Code Management
Workflows and Job Management
Design of ExperimentsOptimization
Persistent Cluster
Software Libraries
Web portal, API, CLI
Hybrid Cluster Management
Rescale
Bare Metal Servers (P100s and K80s) Cloud Object StorageIBM Cloud
28
Scale Up to P100s on IBM Cloud
● TensorFlow InceptionV3● ~½ training time compared to previous generation K80s
29
IBM Bare Metal Networking + P100s = Fast Multi-Node
● Multi-Node TensorFlow Distributed (InceptionV3)● ~1.3x faster vs. 4x more K80s on competing provider
30
PyTorch Large Model Training
● P100 enable larger batch sizes for big networks
31
XGBoost - Boosted Tree Construction
● P100 >1.25x faster than CPU and K80● K80 acceleration does not provide benefit
over CPU
● P100 1.5x faster than CPU
32
Streaming Training Data from Object Storage
Data sync (25 minutes, 150 GB)
Training on streaming data
● File system loads images on demand from IBM Object Storage● Large blocks (128MB) to ensure efficient download● Local cache sized to hold entire dataset so subsequent epochs are local
Training on cached data
33
Questions?
Rescale● Mark Whitney, Head of Deep Learning, [email protected]● Tyler Smith, Head of Partnerships, [email protected]
IBM Cloud● Jerry Gutierrez, Global HPC Sales Lead, [email protected]● Casey Knott, IBM Cloud Platform Specialist, [email protected]
Rescale on IBM Cloud● http://www.rescale.com/ibm/
34