A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways...

17
Distributed Training: A Gentle Introduction Stephen Balaban, Lambda lambdalabs.com (650) 479-5530

Transcript of A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways...

Page 1: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Distributed Training:A Gentle Introduction

Stephen Balaban, Lambda

lambdalabs.com (650) 479-5530

Page 2: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Single GPU Training

Examples:- TensorFlow- PyTorch- Caffe / Caffe 2- MXNet- etc.

Problems:- Small batch size =>

noisier stochastic approximation of the gradient => lower learning rate => slower training.

2lambdalabs.com (650) 479-5530

Computation Happens:- On one GPU

Gradient transfers:- N/A

Model transfers:- N/A

GPU

Gradient computation and

model update occurs on the GPU

TIME

w∇

Page 3: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Multi GPU Training (CPU Parameter Server)

Examples:- TensorFlow w/ Graph

Replication AKA Parameter Server AKA “Towers”

- PyTorch- Caffe

Problems:- Not good for low

arithmetic intensity models.

- Performance highly dependent on PCIe topology.

3lambdalabs.com (650) 479-5530

CPU

Computation Happens:- On all GPUs and CPU

Gradient transfers:- From GPU to CPU

(reduce)

Model transfers:- From CPU to GPU

(broadcast)

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Average Gradients(reduce)

Update Model

(broadcast)

w

TIME

Page 4: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Multi GPU Training (Multi GPU all-reduce)

Examples:- TensorFlow + NCCL- PyTorch + NCCL

Problems:- Not good for high

arithmetic intensity models.

- Performance highly dependent on PCIe topology.

4lambdalabs.com (650) 479-5530

Computation Happens:- On all GPUs

Gradient transfers:- GPU to GPU during

NCCL all-reduce

Model transfers:- GPU to GPU during

NCCL all-reduce

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

TIME

Gradient averages and updated model

communication occurs in a

distributed fashion within a single node.

(∇, w)

Page 5: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Asynchronous Distributed SGD

Examples:- Hogwild!- Async SGD is AKA

Downpour SGD

Problems:- Stale gradients.- Code that is difficult to

write and maintain.- Difficult to reason about

order of operations.

5lambdalabs.com (650) 479-5530

Parameter Server (t=0)

Worker A

Computation Happens:- On all workers and

parameter servers

Gradient transfers:- Worker to parameter

server (asynchronously)

Model transfers:- Parameter server to

worker (asynchronously)

Worker B

Worker C

Parameter Server (t=1)

Worker A

Model parameters updated

asynchronously

Gradients sent asynchronously

TIME

∇A∇B

∇C

w0

w1

w0

Page 6: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Synchronous Distributed SGD

Examples:- TensorFlow Distributed- Torch.distributed

Problems:- Needs lots of worker to

parameter server bandwidth.

- Requires extra code and hardware for parameter server.

6lambdalabs.com (650) 479-5530

Computation Happens:- On all workers and

parameter servers

Gradient transfers:- Worker to parameter

server

Model transfers:- Parameter server to

worker

Parameter Server (t=0)

Gradients sent synchronously∇

w

Worker A Worker B Worker C

Worker A Worker B Worker C

Model parameters updated

synchronously

TIME

Page 7: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Multiple Parameter Servers

Examples:- TensorFlow Distributed- Paddle Paddle

Problems:- Need to tune ratio of

parameter servers to workers.

- Again, even more complicated and difficult to maintain code.

7lambdalabs.com (650) 479-5530

Computation Happens:- On all workers and

parameter servers

Gradient transfers:- Worker gradient shards

to parameter servers

Model transfers:- Parameter server model

shards to workers

Parameter Server 1

Gradients split into N shards and sent to

respective parameter server

∇1/

2

w1/2

Worker A

Worker B

Worker C

Worker A

Worker B

Worker C

Model parameter shards joined by

workers

TIME Parameter

Server 2

∇2/

2

w2/2

Page 8: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Ring all-reduce Distributed Training

Examples:- Horovod1

- tensorflow-allreduce

1Horovod uses NCCL 2.0’s implementation of multi-node all-reduce.

8lambdalabs.com (650) 479-5530

Gradient and model update are both

handled as part of the multi-node ring

all-reduce

Worker A

Worker B

Worker C

TIME

Worker A

Worker B

Worker C

Worker A

Worker B

Worker C

w

Computation Happens:- On all workers

Gradient transfers:- Worker transfers

gradient to peers during all-reduce

Model transfers:- Model “update”

happens at the end of multi-node all-reduce operation

Page 9: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Parameter Servers vs Multi-node Ring all-reduce

Multi-node Ring all-reduce- Good for communication

intensive workloads. (Low arithmetic intensity.)

- High node-to-node communication.

Parameter Servers- Good for compute intensive

workloads. (High arithmetic intensity.)

- High node-to-parameter server communication.

9lambdalabs.com (650) 479-5530

Page 10: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

GPU RDMA over InfiniBand

10lambdalabs.com (650) 479-5530

CPU 1PCIe SwitchInfiniBand NIC

GPU 4

Data pathway with RDMA(Directly out the door via PCIe switch)

CPU 1PCIe SwitchInfiniBand NIC

GPU 4

Data pathway without RDMA(Additional copy to CPU memory)

Page 11: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Hardware Configuration for Multi-node all-reduce

11lambdalabs.com (650) 479-5530

Four InfiniBand NICs are placed alongside the NVLink fabric underneath a PCIe switch. They are used for GPU RDMA during the distributed all-reduce operation.

PCIe Switch

GPU 0 GPU 1

GPU 2 GPU 3

InfiniBand NIC

InfiniBand NIC

PCIe Switch

GPU 4 GPU 5

GPU 6 GPU 7

PCIe Switch

PCIe Switch

InfiniBand NIC

InfiniBand NIC

Legend:

Arrow is a 16x PCIe Connection

Green Double Arrow is NVLink

Red Dashed LineIs an InfiniBand

Connection

CPU 1

CPU 2

Page 12: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Four Non-overlapping Pathways

12lambdalabs.com (650) 479-5530

Each GPU has six NVLink connections which allows for four non-overlapping pathways to be drawn through all eight GPUs on the system and out an InfiniBand card, optimizing both GPU to GPU and node to node communication during distributed all-reduce. (See Sylvain Jeaugey’s “NCCL 2.0” presentation for more information.)

PCIe Switch

GPU 0 GPU 1

GPU 2 GPU 3

InfiniBand NIC

InfiniBand NIC

PCIe Switch

GPU 4 GPU 5

GPU 6 GPU 7

PCIe Switch

PCIe Switch

InfiniBand NIC

InfiniBand NIC

CPU 1

CPU 2

Page 13: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

The GPU RDMA (Remote Direct Memory Access) capabilities of the InfiniBand cards and the V100 GPUs allows for a inter-node memory bandwidth of 42 GB/s. 84% of the 50 GB/s theoretical peak allowed by the four cards. 50 GB / s = 4 cards * 100 Gb/s / (8 bits/byte)

Inter-node Bandwidth

13lambdalabs.com (650) 479-5530

42 GB/s

Page 14: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

- Sergeev, Alexander and Mike Del Balso. Horovod: fast and easy distributed deep learning in TensorFlow. https://arxiv.org/pdf/1802.05799.pdf

- Pitch Patarasuk and Xin Yuan. Bandwidth optimal all-reduce algorithms for clusters of workstations. J. Parallel Distrib. Comput., 69:117–124, 2009. https://www.cs.fsu.edu/~xyuan/paper/09jpdc.pdf

- Jeaugey, Sylvain. NCCL 2.0. (2017). http://on-demand.gputechconf.com/gtc/2017/presentation/s7155-jeaugey-nccl.pdf

- WikiChip NVLink. https://fuse.wikichip.org/news/1224/a-look-at-nvidias-nvlink-interconnect-and-the-nvswitch/

Additional thanks to Chuan Li and Steve Clarkson.

Citations

14lambdalabs.com (650) 479-5530

Page 15: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

Lambda Customers

15lambdalabs.com (650) 479-5530

Page 16: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

About Me

● CEO of Lambda.

● Started using CNNs for face recognition in 2012.

● First employee at Perceptio. We developed image recognition CNNs that ran locally on the iPhone. Acquired by Apple in 2015.

● Published in SPIE and NeurIPS.

16lambdalabs.com (650) 479-5530

Page 17: A Gentle Introduction Distributed Training · 2019-05-31 · Four Non-overlapping Pathways lambdalabs.com (650) 479-5530 12 Each GPU has six NVLink connections which allows for four

[email protected]

https://lambdalabs.com

17lambdalabs.com (650) 479-5530