A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways...

Distributed Training:A Gentle Introduction

Stephen Balaban, Lambda

lambdalabs.com (650) 479-5530

https://lambdalabs.com

Single GPU Training

Examples:- TensorFlow- PyTorch- Caffe / Caffe 2- MXNet- etc.

Problems:- Small batch size =>

noisier stochastic approximation of the gradient => lower learning rate => slower training.

2lambdalabs.com (650) 479-5530

Computation Happens:- On one GPU

Gradient transfers:- N/A

Model transfers:- N/A

GPU

Gradient computation and

model update occurs on the GPU

TIME

w∇


Multi GPU Training (CPU Parameter Server)

Examples:- TensorFlow w/ Graph

Replication AKA Parameter Server AKA “Towers”

- PyTorch- Caffe

Problems:- Not good for low

arithmetic intensity models.

- Performance highly dependent on PCIe topology.


CPU

Computation Happens:- On all GPUs and CPU

Gradient transfers:- From GPU to CPU

(reduce)

Model transfers:- From CPU to GPU

(broadcast)

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Average Gradients(reduce)

Update Model

(broadcast)

∇

w

TIME


Multi GPU Training (Multi GPU all-reduce)

Examples:- TensorFlow + NCCL- PyTorch + NCCL

Problems:- Not good for high

arithmetic intensity models.

- Performance highly dependent on PCIe topology.


Computation Happens:- On all GPUs

Gradient transfers:- GPU to GPU during

NCCL all-reduce

Model transfers:- GPU to GPU during

NCCL all-reduce

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

TIME

Gradient averages and updated model

communication occurs in a

distributed fashion within a single node.

(∇, w)


Asynchronous Distributed SGD

Examples:- Hogwild!- Async SGD is AKA

Downpour SGD

Problems:- Stale gradients.- Code that is difficult to

write and maintain.- Difficult to reason about

order of operations.


Parameter Server (t=0)

Worker A

Computation Happens:- On all workers and

parameter servers

Gradient transfers:- Worker to parameter

server (asynchronously)

Model transfers:- Parameter server to

worker (asynchronously)

Worker B

Worker C


Worker A

Model parameters updated

asynchronously

Gradients sent asynchronously

TIME

∇A∇B

∇C

w0

w1

w0


Synchronous Distributed SGD

Examples:- TensorFlow Distributed- Torch.distributed

Problems:- Needs lots of worker to

parameter server bandwidth.

- Requires extra code and hardware for parameter server.



parameter servers

Gradient transfers:- Worker to parameter

server

Model transfers:- Parameter server to

worker


Gradients sent synchronously∇

w

Worker A Worker B Worker C

Worker A Worker B Worker C

Model parameters updated

synchronously

TIME


Multiple Parameter Servers

Examples:- TensorFlow Distributed- Paddle Paddle

Problems:- Need to tune ratio of

parameter servers to workers.

- Again, even more complicated and difficult to maintain code.



parameter servers

Gradient transfers:- Worker gradient shards

to parameter servers

Model transfers:- Parameter server model

shards to workers

Parameter Server 1

Gradients split into N shards and sent to

respective parameter server

∇1/

2

w1/2

Worker A

Worker B

Worker C

Worker A

Worker B

Worker C

Model parameter shards joined by

workers

TIME Parameter

Server 2

∇2/

2

w2/2


Ring all-reduce Distributed Training

Examples:- Horovod1

- tensorflow-allreduce

1Horovod uses NCCL 2.0’s implementation of multi-node all-reduce.


Gradient and model update are both

handled as part of the multi-node ring

all-reduce

Worker A

Worker B

Worker C

TIME

Worker A

Worker B

Worker C

Worker A

Worker B

Worker C

∇

w

Computation Happens:- On all workers

Gradient transfers:- Worker transfers

gradient to peers during all-reduce

Model transfers:- Model “update”

happens at the end of multi-node all-reduce operation


Parameter Servers vs Multi-node Ring all-reduce

Multi-node Ring all-reduce- Good for communication

intensive workloads. (Low arithmetic intensity.)

- High node-to-node communication.

Parameter Servers- Good for compute intensive

workloads. (High arithmetic intensity.)

- High node-to-parameter server communication.



GPU RDMA over InfiniBand


CPU 1PCIe SwitchInfiniBand NIC

GPU 4

Data pathway with RDMA(Directly out the door via PCIe switch)

CPU 1PCIe SwitchInfiniBand NIC

GPU 4

Data pathway without RDMA(Additional copy to CPU memory)


Hardware Configuration for Multi-node all-reduce


Four InfiniBand NICs are placed alongside the NVLink fabric underneath a PCIe switch. They are used for GPU RDMA during the distributed all-reduce operation.

PCIe Switch

GPU 0 GPU 1

GPU 2 GPU 3

InfiniBand NIC

InfiniBand NIC

PCIe Switch

GPU 4 GPU 5

GPU 6 GPU 7

PCIe Switch

PCIe Switch

InfiniBand NIC

InfiniBand NIC

Legend:

Arrow is a 16x PCIe Connection

Green Double Arrow is NVLink

Red Dashed LineIs an InfiniBand

Connection

CPU 1

CPU 2


Four Non-overlapping Pathways


Each GPU has six NVLink connections which allows for four non-overlapping pathways to be drawn through all eight GPUs on the system and out an InfiniBand card, optimizing both GPU to GPU and node to node communication during distributed all-reduce. (See Sylvain Jeaugey’s “NCCL 2.0” presentation for more information.)

PCIe Switch

GPU 0 GPU 1

GPU 2 GPU 3

InfiniBand NIC

InfiniBand NIC

PCIe Switch

GPU 4 GPU 5

GPU 6 GPU 7

PCIe Switch

PCIe Switch

InfiniBand NIC

InfiniBand NIC

CPU 1

CPU 2


The GPU RDMA (Remote Direct Memory Access) capabilities of the InfiniBand cards and the V100 GPUs allows for a inter-node memory bandwidth of 42 GB/s. 84% of the 50 GB/s theoretical peak allowed by the four cards. 50 GB / s = 4 cards * 100 Gb/s / (8 bits/byte)

Inter-node Bandwidth


42 GB/s


- Sergeev, Alexander and Mike Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. https://arxiv.org/pdf/1802.05799.pdf

- Pitch Patarasuk and Xin Yuan. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput., 69:117–124, 2009. https://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf

- Jeaugey, Sylvain. NCCL 2.0. (2017). http://on-demand.gputechconf.com/gtc/2017/presentation/s7155-jeaugey-nccl.pdf

- WikiChip NVLink. https://fuse.wikichip.org/news/1224/a-look-at-nvidias-nvlink-interconnect-and-the-nvswitch/

Additional thanks to Chuan Li and Steve Clarkson.

Citations


https://arxiv.org/pdf/1802.05799.pdf

https://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf

http://on-demand.gputechconf.com/gtc/2017/presentation/s7155-jeaugey-nccl.pdf

https://fuse.wikichip.org/news/1224/a-look-at-nvidias-nvlink-interconnect-and-the-nvswitch/


Lambda Customers



About Me

● CEO of Lambda.

● Started using CNNs for face recognition in 2012.

● First employee at Perceptio. We developed image recognition CNNs that ran locally on the iPhone. Acquired by Apple in 2015.

● Published in SPIE and NeurIPS.



[email protected]




A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways...

Documents

Transcript of A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways...