MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2...
Transcript of MULTI-GPU TRAINING WITH NCCL - NVIDIAon-demand.gputechconf.com/gtc/2018/presentation/s... · 2...
Sylvain Jeaugey
MULTI-GPU TRAINING WITH NCCL
2
MULTI-GPU COMPUTINGHarvesting the power of multiple GPUs
1 GPUMultiple GPUs per system
Multiple systems connected
NCCL : NVIDIA Collective Communication Library
NCCL
3
MULTI-GPU DL TRAININGSingle-GPU
parameters
gradients
batch(e.g. 256 images)
Forward/Backward
UpdateDatabase : GBs of
input data : images, sound, …
4
MULTI-GPU DL TRAININGData parallel
parameters
batch
gradients
local gradients
parameters
batch
gradients
local gradients
parameters
batch
gradients
local gradients
NCCL Allreduce : Sum gradients across GPUs
parameters
batch
gradients
local gradients
5
NCCLA multi-GPU communication library
Sockets (Ethernet)
Infiniband,with GPU Direct RDMA
Between systemsWithin a system
PCIe
NVLink
GPU Direct P2P
6
NCCLArchitecture
NCCL
CUDA
CUBLAS
Tensorflow
(+Horovod)PyTorch MXNet Caffe2 Caffe
Deep Learning Frameworks
NVIDIA GPUs
CUDNN
CNTK
7
TIMELINENCCL history & roadmap
Inter-node
communication
Improved
latency
Aggregated
operations
Large scale
algorithms
Point-to-point
primitives
(Send/Recv)
2.0 2.1
2.2 2.3 2.4
Intra-node
communication
1.x
8
NCCL 2.0Provide best performance to DL apps
0
10
20
30
40
50
60
70
QPI CPU PCI Switch DGX1 (Pascal) DGX1 (Volta)
Allreduce Bandwidth (OMB, size=128MB)
58
12
132
62
GB/s
9
NCCL 2.1ResNet50 buffer size
Latency is important in some workloads, e.g. ResNet 50, in particular when reductions are done for each layer.
64 128 256 512 1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M 2M02468
10121416182022242628303234
Bytes
Occ
ure
nce
s
ResNet50 / MXNet
FP32
FP16
10
NCCL 2.1Latency improvement
0
20
40
60
80
100
120
140
160
2 4 8 16
NCCL Latency (in us)
2.0.5 2.1.0
n GPUs
265
75
40
22
8 10 14
36
1 node, NVLink2 nodes,
NVLink+Infiniband
7x
5x
μs
11
NCCL
NCCL 2.2Aggregated operations : principle
CUDA
DL frameworkncclAllReduce
cudaLaunchKernel
Principle : Merge multiple operations on the same CUDA devicePay the launch overhead only once (more operations per second)Use multiple NVLinks simultaneously (more bandwidth)
12
NCCL 2.2Aggregated operations : overhead
# of Aggregated ops
μs
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
1 2 4 8 16 32 64 128 256
Per-operation time, 8 GPUs, 8 Bytes reduction
13
NCCL 2.2Aggregated operations : usage
Use ncclGroupStart() / ncclGroupEnd() around the NCCL operations we want to aggregate :
ncclGroupStart();
for (int op=0; op<nops; op++) {
ncclAllReduce(
layers[op].localGradients,
layers[op].globalGradients,
layers[op].gradientSize,
ncclFloat, ncclSum, ncclComm, ncclStream);
}
ncclGroupEnd();
// All operations are only guaranteed to be posted on the stream after ncclGroupEnd
cudaStreamSynchronize(ncclStream);
14
NCCL 2.2Aggregated operations : usage
Can be combined/nested with multi-GPU grouping :
ncclGroupStart();
for (int op=0; op<nops; op++) {
for (int gpu=0; gpu<ngpus; gpu++) {
ncclGroupStart();
ncclAllReduce(
layers[op].localGradients[gpu],
layers[op].globalGradients[gpu],
layers[op].gradientSize,
ncclFloat, ncclSum, ncclComms[gpu], ncclStreams[gpu]);
ncclGroupEnd();
}
}
ncclGroupEnd();
// All operations are only guaranteed to be posted on the stream after the last ncclGroupEnd
for (int gpu=0; gpu<ngpus; gpu++)
cudaStreamSynchronize(ncclStreams[gpu]);
15
NCCL 2.2Aggregated operations : other uses
ReduceScatterV = Aggregation of multiple reductions operationsncclGroupStart();
for (int rank=0; rank<nranks; rank++) {
ncclReduce(sendbuff+offsets[rank], recvbuff+offsets[rank],
recvcounts[rank], datatype, redOp, rank, comm, stream);
}
ncclGroupEnd();
AllGatherV = Aggregation of multiple broadcasts operationsncclGroupStart();
for (int rank=0; rank<nranks; rank++) {
ncclBroadcast(sendbuff+offsets[rank], recvbuff+offsets[rank],
recvcounts[rank], datatype, rank, comm, stream);
}
ncclGroupEnd();
16
NCCL 2.3Large scale algorithms
0
50
100
150
200
250
300
2 4 8 16 32 64 128
Allreduce Latency
2.2
2.3 (projected)
17
NCCL 2.4Point-to-point primitives
Send / Receive , Scatter[v],Gather[v],Alltoall[v,w], neighbor collectives, …
scatter
gather
alltoall
Neighbor
18
NCCLSummary
Optimized inter-GPU communication for DL and HPCOptimized for all NVIDIA platforms, most OEMs and CloudScales to 100s of GPUs, targeting 10,000s in the near future.
Aims at covering all communication needs for multi-GPU computing.
Only relies on CUDA. No dependency on MPI or any parallel environment.
More questions ? Connect with the Experts : NCCL Wed 28, 3pm