A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways...
Transcript of A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways...
Distributed Training:A Gentle Introduction
Stephen Balaban, Lambda
lambdalabs.com (650) 479-5530
Single GPU Training
Examples:- TensorFlow- PyTorch- Caffe / Caffe 2- MXNet- etc.
Problems:- Small batch size =>
noisier stochastic approximation of the gradient => lower learning rate => slower training.
2lambdalabs.com (650) 479-5530
Computation Happens:- On one GPU
Gradient transfers:- N/A
Model transfers:- N/A
GPU
Gradient computation and
model update occurs on the GPU
TIME
w∇
Multi GPU Training (CPU Parameter Server)
Examples:- TensorFlow w/ Graph
Replication AKA Parameter Server AKA “Towers”
- PyTorch- Caffe
Problems:- Not good for low
arithmetic intensity models.
- Performance highly dependent on PCIe topology.
3lambdalabs.com (650) 479-5530
CPU
Computation Happens:- On all GPUs and CPU
Gradient transfers:- From GPU to CPU
(reduce)
Model transfers:- From CPU to GPU
(broadcast)
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
Average Gradients(reduce)
Update Model
(broadcast)
∇
w
TIME
Multi GPU Training (Multi GPU all-reduce)
Examples:- TensorFlow + NCCL- PyTorch + NCCL
Problems:- Not good for high
arithmetic intensity models.
- Performance highly dependent on PCIe topology.
4lambdalabs.com (650) 479-5530
Computation Happens:- On all GPUs
Gradient transfers:- GPU to GPU during
NCCL all-reduce
Model transfers:- GPU to GPU during
NCCL all-reduce
GPU
GPU
GPU
GPU
GPU
GPU
GPU
GPU
TIME
Gradient averages and updated model
communication occurs in a
distributed fashion within a single node.
(∇, w)
Asynchronous Distributed SGD
Examples:- Hogwild!- Async SGD is AKA
Downpour SGD
Problems:- Stale gradients.- Code that is difficult to
write and maintain.- Difficult to reason about
order of operations.
5lambdalabs.com (650) 479-5530
Parameter Server (t=0)
Worker A
Computation Happens:- On all workers and
parameter servers
Gradient transfers:- Worker to parameter
server (asynchronously)
Model transfers:- Parameter server to
worker (asynchronously)
Worker B
Worker C
Parameter Server (t=1)
Worker A
Model parameters updated
asynchronously
Gradients sent asynchronously
TIME
∇A∇B
∇C
w0
w1
w0
Synchronous Distributed SGD
Examples:- TensorFlow Distributed- Torch.distributed
Problems:- Needs lots of worker to
parameter server bandwidth.
- Requires extra code and hardware for parameter server.
6lambdalabs.com (650) 479-5530
Computation Happens:- On all workers and
parameter servers
Gradient transfers:- Worker to parameter
server
Model transfers:- Parameter server to
worker
Parameter Server (t=0)
Gradients sent synchronously∇
w
Worker A Worker B Worker C
Worker A Worker B Worker C
Model parameters updated
synchronously
TIME
Multiple Parameter Servers
Examples:- TensorFlow Distributed- Paddle Paddle
Problems:- Need to tune ratio of
parameter servers to workers.
- Again, even more complicated and difficult to maintain code.
7lambdalabs.com (650) 479-5530
Computation Happens:- On all workers and
parameter servers
Gradient transfers:- Worker gradient shards
to parameter servers
Model transfers:- Parameter server model
shards to workers
Parameter Server 1
Gradients split into N shards and sent to
respective parameter server
∇1/
2
w1/2
Worker A
Worker B
Worker C
Worker A
Worker B
Worker C
Model parameter shards joined by
workers
TIME Parameter
Server 2
∇2/
2
w2/2
Ring all-reduce Distributed Training
Examples:- Horovod1
- tensorflow-allreduce
1Horovod uses NCCL 2.0’s implementation of multi-node all-reduce.
8lambdalabs.com (650) 479-5530
Gradient and model update are both
handled as part of the multi-node ring
all-reduce
Worker A
Worker B
Worker C
TIME
Worker A
Worker B
Worker C
Worker A
Worker B
Worker C
∇
w
Computation Happens:- On all workers
Gradient transfers:- Worker transfers
gradient to peers during all-reduce
Model transfers:- Model “update”
happens at the end of multi-node all-reduce operation
Parameter Servers vs Multi-node Ring all-reduce
Multi-node Ring all-reduce- Good for communication
intensive workloads. (Low arithmetic intensity.)
- High node-to-node communication.
Parameter Servers- Good for compute intensive
workloads. (High arithmetic intensity.)
- High node-to-parameter server communication.
9lambdalabs.com (650) 479-5530
GPU RDMA over InfiniBand
10lambdalabs.com (650) 479-5530
CPU 1PCIe SwitchInfiniBand NIC
GPU 4
Data pathway with RDMA(Directly out the door via PCIe switch)
CPU 1PCIe SwitchInfiniBand NIC
GPU 4
Data pathway without RDMA(Additional copy to CPU memory)
Hardware Configuration for Multi-node all-reduce
11lambdalabs.com (650) 479-5530
Four InfiniBand NICs are placed alongside the NVLink fabric underneath a PCIe switch. They are used for GPU RDMA during the distributed all-reduce operation.
PCIe Switch
GPU 0 GPU 1
GPU 2 GPU 3
InfiniBand NIC
InfiniBand NIC
PCIe Switch
GPU 4 GPU 5
GPU 6 GPU 7
PCIe Switch
PCIe Switch
InfiniBand NIC
InfiniBand NIC
Legend:
Arrow is a 16x PCIe Connection
Green Double Arrow is NVLink
Red Dashed LineIs an InfiniBand
Connection
CPU 1
CPU 2
Four Non-overlapping Pathways
12lambdalabs.com (650) 479-5530
Each GPU has six NVLink connections which allows for four non-overlapping pathways to be drawn through all eight GPUs on the system and out an InfiniBand card, optimizing both GPU to GPU and node to node communication during distributed all-reduce. (See Sylvain Jeaugey’s “NCCL 2.0” presentation for more information.)
PCIe Switch
GPU 0 GPU 1
GPU 2 GPU 3
InfiniBand NIC
InfiniBand NIC
PCIe Switch
GPU 4 GPU 5
GPU 6 GPU 7
PCIe Switch
PCIe Switch
InfiniBand NIC
InfiniBand NIC
CPU 1
CPU 2
The GPU RDMA (Remote Direct Memory Access) capabilities of the InfiniBand cards and the V100 GPUs allows for a inter-node memory bandwidth of 42 GB/s. 84% of the 50 GB/s theoretical peak allowed by the four cards. 50 GB / s = 4 cards * 100 Gb/s / (8 bits/byte)
Inter-node Bandwidth
13lambdalabs.com (650) 479-5530
42 GB/s
- Sergeev, Alexander and Mike Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. https://arxiv.org/pdf/1802.05799.pdf
- Pitch Patarasuk and Xin Yuan. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput., 69:117–124, 2009. https://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf
- Jeaugey, Sylvain. NCCL 2.0. (2017). http://on-demand.gputechconf.com/gtc/2017/presentation/s7155-jeaugey-nccl.pdf
- WikiChip NVLink. https://fuse.wikichip.org/news/1224/a-look-at-nvidias-nvlink-interconnect-and-the-nvswitch/
Additional thanks to Chuan Li and Steve Clarkson.
Citations
14lambdalabs.com (650) 479-5530
About Me
● CEO of Lambda.
● Started using CNNs for face recognition in 2012.
● First employee at Perceptio. We developed image recognition CNNs that ran locally on the iPhone. Acquired by Apple in 2015.
● Published in SPIE and NeurIPS.
16lambdalabs.com (650) 479-5530