MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2...

Sylvain Jeaugey

MULTI-GPU TRAINING WITH NCCL

2

MULTI-GPU COMPUTINGHarvesting the power of multiple GPUs

1 GPUMultiple GPUs per system

Multiple systems connected

NCCL : NVIDIA Collective Communication Library

NCCL

3

MULTI-GPU DL TRAININGSingle-GPU

parameters

gradients

batch(e.g. 256 images)

Forward/Backward

UpdateDatabase : GBs of

input data : images, sound, …

4

MULTI-GPU DL TRAININGData parallel

parameters

batch

gradients

local gradients

parameters

batch

gradients

local gradients

parameters

batch

gradients

local gradients

NCCL Allreduce : Sum gradients across GPUs

parameters

batch

gradients

local gradients

5

NCCLA multi-GPU communication library

Sockets (Ethernet)

Infiniband,with GPU Direct RDMA

Between systemsWithin a system

PCIe

NVLink

GPU Direct P2P

6

NCCLArchitecture

NCCL

CUDA

CUBLAS

Tensorflow

(+Horovod)PyTorch MXNet Caffe2 Caffe

Deep Learning Frameworks

NVIDIA GPUs

CUDNN

CNTK

7

TIMELINENCCL history & roadmap

Inter-node

communication

Improved

latency

Aggregated

operations

Large scale

algorithms

Point-to-point

primitives

(Send/Recv)

2.0 2.1

2.2 2.3 2.4

Intra-node

communication

1.x

8

NCCL 2.0Provide best performance to DL apps

0

10

20

30

40

50

60

70

QPI CPU PCI Switch DGX1 (Pascal) DGX1 (Volta)

Allreduce Bandwidth (OMB, size=128MB)

58

12

132

62

GB/s

9

NCCL 2.1ResNet50 buffer size

Latency is important in some workloads, e.g. ResNet 50, in particular when reductions are done for each layer.

64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M02468

10121416182022242628303234

Bytes

Occ

ure

nce

s

ResNet50 / MXNet

FP32

FP16

10

NCCL 2.1Latency improvement

0

20

40

60

80

100

120

140

160

2 4 8 16

NCCL Latency (in us)

2.0.5 2.1.0

n GPUs

265

75

40

22

8 10 14

36

1 node, NVLink2 nodes,

NVLink+Infiniband

7x

5x

μs

11

NCCL

NCCL 2.2Aggregated operations : principle

CUDA

DL frameworkncclAllReduce

cudaLaunchKernel

Principle : Merge multiple operations on the same CUDA devicePay the launch overhead only once (more operations per second)Use multiple NVLinks simultaneously (more bandwidth)

12

NCCL 2.2Aggregated operations : overhead

# of Aggregated ops

μs

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

1 2 4 8 16 32 64 128 256

Per-operation time, 8 GPUs, 8 Bytes reduction

13

NCCL 2.2Aggregated operations : usage

Use ncclGroupStart() / ncclGroupEnd() around the NCCL operations we want to aggregate :

ncclGroupStart();

for (int op=0; op<nops; op++) {

ncclAllReduce(

layers[op].localGradients,

layers[op].globalGradients,

layers[op].gradientSize,

ncclFloat, ncclSum, ncclComm, ncclStream);

}

ncclGroupEnd();

// All operations are only guaranteed to be posted on the stream after ncclGroupEnd

cudaStreamSynchronize(ncclStream);

14

NCCL 2.2Aggregated operations : usage

Can be combined/nested with multi-GPU grouping :

ncclGroupStart();

for (int op=0; op<nops; op++) {

for (int gpu=0; gpu<ngpus; gpu++) {

ncclGroupStart();

ncclAllReduce(

layers[op].localGradients[gpu],

layers[op].globalGradients[gpu],

layers[op].gradientSize,

ncclFloat, ncclSum, ncclComms[gpu], ncclStreams[gpu]);

ncclGroupEnd();

}

}

ncclGroupEnd();

// All operations are only guaranteed to be posted on the stream after the last ncclGroupEnd

for (int gpu=0; gpu<ngpus; gpu++)

cudaStreamSynchronize(ncclStreams[gpu]);

15

NCCL 2.2Aggregated operations : other uses

ReduceScatterV = Aggregation of multiple reductions operationsncclGroupStart();

for (int rank=0; rank<nranks; rank++) {

ncclReduce(sendbuff+offsets[rank], recvbuff+offsets[rank],

recvcounts[rank], datatype, redOp, rank, comm, stream);

}

ncclGroupEnd();

AllGatherV = Aggregation of multiple broadcasts operationsncclGroupStart();

for (int rank=0; rank<nranks; rank++) {

ncclBroadcast(sendbuff+offsets[rank], recvbuff+offsets[rank],

recvcounts[rank], datatype, rank, comm, stream);

}

ncclGroupEnd();

16

NCCL 2.3Large scale algorithms

0

50

100

150

200

250

300

2 4 8 16 32 64 128

Allreduce Latency

2.2

2.3 (projected)

17

NCCL 2.4Point-to-point primitives

Send / Receive , Scatter[v],Gather[v],Alltoall[v,w], neighbor collectives, …

scatter

gather

alltoall

Neighbor

18

NCCLSummary

Optimized inter-GPU communication for DL and HPCOptimized for all NVIDIA platforms, most OEMs and CloudScales to 100s of GPUs, targeting 10,000s in the near future.

Aims at covering all communication needs for multi-GPU computing.

Only relies on CUDA. No dependency on MPI or any parallel environment.

More questions ? Connect with the Experts : NCCL Wed 28, 3pm

MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2...

Documents

Transcript of MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2...