Scalable Distributed Training with Parameter Hub: a ...
Transcript of Scalable Distributed Training with Parameter Hub: a ...
Scalable Distributed Training with Parameter Hub: a whirlwind tour
TVM Stack
Optimization
AutoTVM
AutoVTA
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC Hardware Fleet
Optimization
AutoTVM
AutoVTA
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASICHardware
Fleet
Active Topology ProbingYour Cloud
Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own
cluster.
Parameter Hub
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4
Optimized, topology-aware and dynamic mechanism for inter-machine communication
Parameter Hub
Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4
* In the cloud-based training context
Optimized, topology-aware and dynamic mechanism for inter-machine communication
Deep Learning constitutes an important workload in cloud today.
Major cloud providers all have an ecosystem for cloud learning.
5
Deep Learning constitutes an important workload in cloud today.
Major cloud providers all have an ecosystem for cloud learning.
5
6
Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook
6
Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook
7EC2 reclaims your GPU instances as they run out of capacity
7EC2 reclaims your GPU instances as they run out of capacity
Distributed TrainingINDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE
F1 B1
A1 O1
F1 B1
Parameter Server
Worker 1
Worker 2
F2 B2
F2 B2
(F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization
A2 O2
F3
F3
B2
B2
Time
Worker Parameter Server
8
Distributed TrainingINDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE
F1 B1
A1 O1
F1 B1
Parameter Sever
Worker 1
Worker 2
F2 B2
F2 B2
(F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization
A2 O2
F3
F3
B2
B2
Time
Worker Parameter Server
9
Distributed Training TodayIN THE CONTEXT OF THE CLOUD
Network Core
ToR
Machine with GPUs
Machine
ToR
Machine with GPUs
Machine
10
Distributed Training TodayFORWARD AND BACKWARD PASSES IN WORKER
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
11
Distributed Training TodayAGGREGATION AND OPTIMIZATION IN PS
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
12
Distributed training is communication boundResNet 269
Seco
nds
0
0.45
0.9
1.35
1.8
GRID 520 K80 M60 V100
- Problem gets worse over time: shifting bottleneck.
- With modern GPUs most of the time is spent on communication.
- Making GPUs faster will do little to increase throughput
- Wasting compute resources. 13
2012 2014 2015 2017
GPU and Network active
GPU idle, waiting on network
Distributed training is communication bound
Inception V3
AlexNet
GoogleNet
ResNet 269
Bottlenecks in DDNN training
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.
15
Bottlenecks in DDNN training
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
FRAMEWORK BOTTLENECKS
16
Bottlenecks in DDNN training
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
FRAMEWORK BOTTLENECKS
GPU
TrainingFramework
…
Network
16
Bottlenecks in DDNN trainingFRAMEWORK BOTTLENECKS
17
Bottlenecks in DDNN trainingFRAMEWORK BOTTLENECKS
ResNet 269
Inception
GoogleNet
AlexNet
Seconds
0 0.4 0.8 1.2 1.6
ComputeData Copy and CommunicationAggregatorOptimizerSynchronization and other Overheads
17
Bottlenecks in DDNN trainingMAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
18
Bottlenecks in DDNN training
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
BANDWIDTH BOTTLENECK
19
Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH
20
25 Gbps
10 Gbps
1000 Gbps
1300 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?
8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH
20
25 Gbps
10 Gbps Cloud Bandwidth
1000 Gbps
1300 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?
8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH
20
25 Gbps
10 Gbps Cloud Bandwidth
1000 Gbps
1300 Gbps
GoogleNet / Inception: 40 Gbps
Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?
8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH
20
25 Gbps
10 Gbps Cloud Bandwidth
1000 Gbps
1300 Gbps
GoogleNet / Inception: 40 Gbps
ResNet: 100 Gbps
Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?
8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH
20
25 Gbps
10 Gbps Cloud Bandwidth
1000 Gbps
1300 Gbps
GoogleNet / Inception: 40 Gbps
ResNet: 100 Gbps
AlexNet: 1200 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?
8 workers, GTX 1080 Ti, central parameter servers. MxNet
Bottlenecks in Cloud-based DDNN trainingMAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
21
Bottlenecks in Cloud-based DDNN trainingDEPLOYMENT-RELATED OVERHEAD
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
22
Bottlenecks in Cloud-based DDNN trainingDEPLOYMENT-RELATED OVERHEAD
• Transient congestion, or oversubscription by design
• Cross-rack communication cost is higher than Intra-rack communication.
Hosts
9 Gbps
4 Gbps
1 2 3 4 5 6 7 81
2 4.7
3 8.9
4.7
4 8.9
4.7
8.9
5 8.9
4.7
8.9
8.9
6 4.7
9.0
4.7
4.7
4.7
7 8.9
4.7
9.0
8.9
9.0
4.7
Cluster 1: 1 3 4 5 7Cluster 2: 2 6 8
Hos
ts 23
Parameter Hub OptimizationsCODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD-BASED DDNN TRAINING
ToR
PS 1
ToR
PS 2
Worker 2
24
Network Core
Eliminating framework bottlenecks:
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
GPU
Data CopyAggregationOptimization
…Network
PHub Optimizations: streamlining DDNN training pipeline
25
Network Core
Eliminating framework bottlenecks:
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
GPU
Data CopyAggregationOptimization
…Network
PHub Optimizations: streamlining DDNN training pipeline
26
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
GRADIENTS
CPU
Software Optimizations
27
MEM
ORY
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
GRADIENTS
CPU
Software Optimizations
27
MEM
ORY
Software OptimizationsGRADIENT AGGREGATION AND OPTIMIZATION
Each core reads the input Q from different workers and
writes to different locations to the output queue
For each input Q, launch a series of threads for
aggregation. This is used in MxNet. (Wide Aggregation)
Requires synchronization. Great locality. No synchronization
Sequentially aggregates the same portion of gradients within each queue. (Tall
Aggregation)
Organize processors into hierarchy. Perform NUMA
aware tree reduction.
NUMA 0 NUMA 1
Great locality. No synchronization Too much coherence and synchronization
28
Software OptimizationsGRADIENT AGGREGATION AND OPTIMIZATION
Each core reads the input Q from different workers and
writes to different locations to the output queue
For each input Q, launch a series of threads for
aggregation. This is used in MxNet. (Wide Aggregation)
Requires synchronization. Great locality. No synchronization
Sequentially aggregates the same portion of gradients within each queue. (Tall
Aggregation)
Organize processors into hierarchy. Perform NUMA
aware tree reduction.
NUMA 0 NUMA 1
Great locality. No synchronization Too much coherence and synchronization
28
Software OptimizationsTALL AGGREGATION AND OPTIMIZATION
- Chunk a gradient into a series of virtual gradients deterministically.
- A virtual gradient is mapped to a particular core on the server.
Gradient Array for Key 0 from 8 workers
Cor
e M
appi
ngs
29
Software OptimizationsTALL AGGREGATION AND OPTIMIZATION
- Chunk a gradient into a series of virtual gradients deterministically.
- A virtual gradient is mapped to a particular core on the server.
Gradient Array for Key 0 from 8 workers
Cor
e M
appi
ngs
29
Software OptimizationsTALL AGGREGATION AND OPTIMIZATION
- Chunk a gradient into a series of virtual gradients deterministically.
- A virtual gradient is mapped to a particular core on the server.
- Virtual gradients are transferred independently.
Gradient Array for Key 0 from 8 workers
Cor
e M
appi
ngs
29
Software OptimizationsTALL AGGREGATION AND OPTIMIZATION
- Chunk a gradient into a series of virtual gradients deterministically.
- A virtual gradient is mapped to a particular core on the server.
- Virtual gradients are transferred independently.
- A chunk is only processed by a single core : maintaining maximum locality.
Gradient Array for Key 0 from 8 workers
Cor
e M
appi
ngs
Aggr
egat
ed
29
Software OptimizationsTALL AGGREGATION AND OPTIMIZATION
When Aggregation is done, PHub:- PHub optimizes a chunk with the same
core that aggregates that chunk.
Gradient Array for Key 0 from 8 workers
Aggr
egat
ed
30
Software OptimizationsTALL AGGREGATION AND OPTIMIZATION
When Aggregation is done, PHub:- PHub optimizes a chunk with the same
core that aggregates that chunk.- FP32-level streaming aggregation and
optimization to hide communication latency.
Gradient Array for Key 0 from 8 workers
Aggr
egat
ed
31
Software OptimizationsTALL AGGREGATION AND OPTIMIZATION
When Aggregation is done, PHub:- PHub optimizes a chunk with the same
core that aggregates that chunk.- FP32-level streaming aggregation and
optimization to hide communication latency.
Gradient Array for Key 0 from 8 workers
Opt
imiz
ed
Aggr
egat
ed
31
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic
32
Network Core
ToR
Worker 1
PS 1
ToR
PS 2
Worker 2
Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic
33
Two-Phase Hierarchical AggregationRACK SCALE PARAMETER SERVICE
Cluster Network
ToR
Rack
Worker/PS 1
Worker/PS N
ToR
Worker/PS 1
Worker/PS 2
CM
34
Two-Phase Hierarchical AggregationRACK SCALE PARAMETER SERVICE
Cluster Network
ToR
Rack
Worker/PS 1
Worker/PS N
ToR
Worker/PS 1
Worker/PS 2
CM
Aggregator PBox
35
Two-Phase Hierarchical AggregationADAPTING TO THE DATACENTER NETWORK TOPOLOGY
Cluster Network
ToR
Rack
Worker/PS 1
Worker/PS N
ToR
Worker/PS 1
Worker/PS 2
Aggregator Aggregator
36
Two-Phase Hierarchical AggregationADAPTING TO THE DATACENTER NETWORK TOPOLOGY
Cluster Network
ToR
Rack
Worker/PS 1
Worker/PS N
ToR
Worker/PS 1
Worker/PS 2
Aggregator Aggregator1. Intra-Rack central aggregation
36
Two-Phase Hierarchical AggregationADAPTING TO THE DATACENTER NETWORK TOPOLOGY
Cluster Network
ToR
Rack
Worker/PS 1
Worker/PS N
ToR
Worker/PS 1
Worker/PS 2
Aggregator Aggregator1. Intra-Rack central aggregation
2. Inter-Rack aggregation
N times traffic reduction!
36
Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING
VMsAzure/EC2
DPDK-basedlatency Probe
Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING
VMsAzure/EC2
DPDK-basedlatency Probe …
Distance Matrix
Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING
VMsAzure/EC2
DPDK-basedlatency Probe …
Distance Matrix
Clustering Algorithms
Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING
VMsAzure/EC2
DPDK-basedlatency Probe …
Distance Matrix
Clustering Algorithms
Inferred Network Topology*
Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING
VMsAzure/EC2
DPDK-basedlatency Probe …
Distance Matrix
Clustering Algorithms
Inferred Network Topology*
Automagic ScheduleGeneration
Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING
VMsAzure/EC2
Hierarchical Reduction
Plan
DPDK-basedlatency Probe …
Distance Matrix
Clustering Algorithms
Inferred Network Topology*
Automagic ScheduleGeneration
Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING
VMsAzure/EC2
Hierarchical Reduction
Plan
DPDK-basedlatency Probe …
Distance Matrix
Clustering Algorithms
Inferred Network Topology*
Automagic ScheduleGeneration
Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING
VMsAzure/EC2
0
1
1
2
2
Azure (Standard NC6) EC2 (P3.2xlarge)
1.4
2.4
VS Facebook Gloo
Performance in commercial cloud with PHub
38
Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/
Pytorch. ResNet 50.
0
3
5
8
10
Azure (Standard NC6) EC2 (P3.2xlarge)
1.8
9.6
VS Ring Reduction
Framework IntegrationSupport for Mxnet/Pytorch/Caffe2.
var pHub = std::make_shared<PHub>(cfg.redisIp, nMap, keySize, appAddrs, cntr, sizeof(float), cfg.rank, plp);pHub->ToggleUseSchedule(pSchedule);pHub->Reduce();
Optimization
Active Topology ProbingYour Cloud
Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own
cluster.
AutoTVM
AutoVTA
High-Level Differentiable IR
Tensor Expression IR
LLVM, CUDA, Metal VTA
Edge FPGA
Cloud FPGA
ASIC Hardware Fleet
Hardware Parameter Hub
Hardware Parameter Hub
Hardware Parameter HubBalanced computation and communication resource.
- 10 ConnectX-3 Card- 560+Gbps Network BW- 800Gbps PCIe- Fully supported by Software
Parameter Hub
Hardware Parameter Hub35GB/s aggregation throughput.Supports 100+ ResNet-50 training nodes with a single machine.
Gloo HD Gloo Ring PS-Lite PHub SW
2
4.55
7
Hardware Parameter HubResNet-50.See paper for detailed estimates.
Better training throughput/$.
Hardware Parameter HubResNet-50.See paper for detailed estimates.
Better training throughput/$.
020406081012141618202225 %