Scalable Distributed Training with Parameter Hub: a ...

73
Scalable Distributed Training with Parameter Hub: a whirlwind tour

Transcript of Scalable Distributed Training with Parameter Hub: a ...

Page 1: Scalable Distributed Training with Parameter Hub: a ...

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Page 2: Scalable Distributed Training with Parameter Hub: a ...

TVM Stack

Optimization

AutoTVM

AutoVTA

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA

ASIC Hardware Fleet

Page 3: Scalable Distributed Training with Parameter Hub: a ...

Optimization

AutoTVM

AutoVTA

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA

ASICHardware

Fleet

Active Topology ProbingYour Cloud

Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own

cluster.

Page 4: Scalable Distributed Training with Parameter Hub: a ...

Parameter Hub

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

Optimized, topology-aware and dynamic mechanism for inter-machine communication

Page 5: Scalable Distributed Training with Parameter Hub: a ...

Parameter Hub

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

* In the cloud-based training context

Optimized, topology-aware and dynamic mechanism for inter-machine communication

Page 6: Scalable Distributed Training with Parameter Hub: a ...

Deep Learning constitutes an important workload in cloud today.

Major cloud providers all have an ecosystem for cloud learning.

5

Page 7: Scalable Distributed Training with Parameter Hub: a ...

Deep Learning constitutes an important workload in cloud today.

Major cloud providers all have an ecosystem for cloud learning.

5

Page 8: Scalable Distributed Training with Parameter Hub: a ...

6

Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook

Page 9: Scalable Distributed Training with Parameter Hub: a ...

6

Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook

Page 10: Scalable Distributed Training with Parameter Hub: a ...

7EC2 reclaims your GPU instances as they run out of capacity

Page 11: Scalable Distributed Training with Parameter Hub: a ...

7EC2 reclaims your GPU instances as they run out of capacity

Page 12: Scalable Distributed Training with Parameter Hub: a ...

Distributed TrainingINDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1

A1 O1

F1 B1

Parameter Server

Worker 1

Worker 2

F2 B2

F2 B2

(F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization

A2 O2

F3

F3

B2

B2

Time

Worker Parameter Server

8

Page 13: Scalable Distributed Training with Parameter Hub: a ...

Distributed TrainingINDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1

A1 O1

F1 B1

Parameter Sever

Worker 1

Worker 2

F2 B2

F2 B2

(F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization

A2 O2

F3

F3

B2

B2

Time

Worker Parameter Server

9

Page 14: Scalable Distributed Training with Parameter Hub: a ...

Distributed Training TodayIN THE CONTEXT OF THE CLOUD

Network Core

ToR

Machine with GPUs

Machine

ToR

Machine with GPUs

Machine

10

Page 15: Scalable Distributed Training with Parameter Hub: a ...

Distributed Training TodayFORWARD AND BACKWARD PASSES IN WORKER

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

11

Page 16: Scalable Distributed Training with Parameter Hub: a ...

Distributed Training TodayAGGREGATION AND OPTIMIZATION IN PS

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

12

Page 17: Scalable Distributed Training with Parameter Hub: a ...

Distributed training is communication boundResNet 269

Seco

nds

0

0.45

0.9

1.35

1.8

GRID 520 K80 M60 V100

- Problem gets worse over time: shifting bottleneck.

- With modern GPUs most of the time is spent on communication.

- Making GPUs faster will do little to increase throughput

- Wasting compute resources. 13

2012 2014 2015 2017

GPU and Network active

GPU idle, waiting on network

Page 18: Scalable Distributed Training with Parameter Hub: a ...

Distributed training is communication bound

Inception V3

AlexNet

GoogleNet

ResNet 269

Page 19: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in DDNN training

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

15

Page 20: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in DDNN training

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

FRAMEWORK BOTTLENECKS

16

Page 21: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in DDNN training

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

FRAMEWORK BOTTLENECKS

GPU

TrainingFramework

Network

16

Page 22: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in DDNN trainingFRAMEWORK BOTTLENECKS

17

Page 23: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in DDNN trainingFRAMEWORK BOTTLENECKS

ResNet 269

Inception

GoogleNet

AlexNet

Seconds

0 0.4 0.8 1.2 1.6

ComputeData Copy and CommunicationAggregatorOptimizerSynchronization and other Overheads

17

Page 24: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in DDNN trainingMAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

18

Page 25: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in DDNN training

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

BANDWIDTH BOTTLENECK

19

Page 26: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH

20

25 Gbps

10 Gbps

1000 Gbps

1300 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?

8 workers, GTX 1080 Ti, central parameter servers. MxNet

Page 27: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH

20

25 Gbps

10 Gbps Cloud Bandwidth

1000 Gbps

1300 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?

8 workers, GTX 1080 Ti, central parameter servers. MxNet

Page 28: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH

20

25 Gbps

10 Gbps Cloud Bandwidth

1000 Gbps

1300 Gbps

GoogleNet / Inception: 40 Gbps

Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?

8 workers, GTX 1080 Ti, central parameter servers. MxNet

Page 29: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH

20

25 Gbps

10 Gbps Cloud Bandwidth

1000 Gbps

1300 Gbps

GoogleNet / Inception: 40 Gbps

ResNet: 100 Gbps

Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?

8 workers, GTX 1080 Ti, central parameter servers. MxNet

Page 30: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH

20

25 Gbps

10 Gbps Cloud Bandwidth

1000 Gbps

1300 Gbps

GoogleNet / Inception: 40 Gbps

ResNet: 100 Gbps

AlexNet: 1200 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?

8 workers, GTX 1080 Ti, central parameter servers. MxNet

Page 31: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in Cloud-based DDNN trainingMAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

21

Page 32: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in Cloud-based DDNN trainingDEPLOYMENT-RELATED OVERHEAD

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

22

Page 33: Scalable Distributed Training with Parameter Hub: a ...

Bottlenecks in Cloud-based DDNN trainingDEPLOYMENT-RELATED OVERHEAD

• Transient congestion, or oversubscription by design

• Cross-rack communication cost is higher than Intra-rack communication.

Hosts

9 Gbps

4 Gbps

1 2 3 4 5 6 7 81

2 4.7

3 8.9

4.7

4 8.9

4.7

8.9

5 8.9

4.7

8.9

8.9

6 4.7

9.0

4.7

4.7

4.7

7 8.9

4.7

9.0

8.9

9.0

4.7

Cluster 1: 1 3 4 5 7Cluster 2: 2 6 8

Hos

ts 23

Page 34: Scalable Distributed Training with Parameter Hub: a ...

Parameter Hub OptimizationsCODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD-BASED DDNN TRAINING

ToR

PS 1

ToR

PS 2

Worker 2

24

Page 35: Scalable Distributed Training with Parameter Hub: a ...

Network Core

Eliminating framework bottlenecks:

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

GPU

Data CopyAggregationOptimization

…Network

PHub Optimizations: streamlining DDNN training pipeline

25

Page 36: Scalable Distributed Training with Parameter Hub: a ...

Network Core

Eliminating framework bottlenecks:

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

GPU

Data CopyAggregationOptimization

…Network

PHub Optimizations: streamlining DDNN training pipeline

26

Page 37: Scalable Distributed Training with Parameter Hub: a ...

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

GRADIENTS

CPU

Software Optimizations

27

MEM

ORY

Page 38: Scalable Distributed Training with Parameter Hub: a ...

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

GRADIENTS

CPU

Software Optimizations

27

MEM

ORY

Page 39: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsGRADIENT AGGREGATION AND OPTIMIZATION

Each core reads the input Q from different workers and

writes to different locations to the output queue

For each input Q, launch a series of threads for

aggregation. This is used in MxNet. (Wide Aggregation)

Requires synchronization. Great locality. No synchronization

Sequentially aggregates the same portion of gradients within each queue. (Tall

Aggregation)

Organize processors into hierarchy. Perform NUMA

aware tree reduction.

NUMA 0 NUMA 1

Great locality. No synchronization Too much coherence and synchronization

28

Page 40: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsGRADIENT AGGREGATION AND OPTIMIZATION

Each core reads the input Q from different workers and

writes to different locations to the output queue

For each input Q, launch a series of threads for

aggregation. This is used in MxNet. (Wide Aggregation)

Requires synchronization. Great locality. No synchronization

Sequentially aggregates the same portion of gradients within each queue. (Tall

Aggregation)

Organize processors into hierarchy. Perform NUMA

aware tree reduction.

NUMA 0 NUMA 1

Great locality. No synchronization Too much coherence and synchronization

28

Page 41: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsTALL AGGREGATION AND OPTIMIZATION

- Chunk a gradient into a series of virtual gradients deterministically.

- A virtual gradient is mapped to a particular core on the server.

Gradient Array for Key 0 from 8 workers

Cor

e M

appi

ngs

29

Page 42: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsTALL AGGREGATION AND OPTIMIZATION

- Chunk a gradient into a series of virtual gradients deterministically.

- A virtual gradient is mapped to a particular core on the server.

Gradient Array for Key 0 from 8 workers

Cor

e M

appi

ngs

29

Page 43: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsTALL AGGREGATION AND OPTIMIZATION

- Chunk a gradient into a series of virtual gradients deterministically.

- A virtual gradient is mapped to a particular core on the server.

- Virtual gradients are transferred independently.

Gradient Array for Key 0 from 8 workers

Cor

e M

appi

ngs

29

Page 44: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsTALL AGGREGATION AND OPTIMIZATION

- Chunk a gradient into a series of virtual gradients deterministically.

- A virtual gradient is mapped to a particular core on the server.

- Virtual gradients are transferred independently.

- A chunk is only processed by a single core : maintaining maximum locality.

Gradient Array for Key 0 from 8 workers

Cor

e M

appi

ngs

Aggr

egat

ed

29

Page 45: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsTALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:- PHub optimizes a chunk with the same

core that aggregates that chunk.

Gradient Array for Key 0 from 8 workers

Aggr

egat

ed

30

Page 46: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsTALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:- PHub optimizes a chunk with the same

core that aggregates that chunk.- FP32-level streaming aggregation and

optimization to hide communication latency.

Gradient Array for Key 0 from 8 workers

Aggr

egat

ed

31

Page 47: Scalable Distributed Training with Parameter Hub: a ...

Software OptimizationsTALL AGGREGATION AND OPTIMIZATION

When Aggregation is done, PHub:- PHub optimizes a chunk with the same

core that aggregates that chunk.- FP32-level streaming aggregation and

optimization to hide communication latency.

Gradient Array for Key 0 from 8 workers

Opt

imiz

ed

Aggr

egat

ed

31

Page 48: Scalable Distributed Training with Parameter Hub: a ...

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

32

Page 49: Scalable Distributed Training with Parameter Hub: a ...

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

33

Page 50: Scalable Distributed Training with Parameter Hub: a ...

Two-Phase Hierarchical AggregationRACK SCALE PARAMETER SERVICE

Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

CM

34

Page 51: Scalable Distributed Training with Parameter Hub: a ...

Two-Phase Hierarchical AggregationRACK SCALE PARAMETER SERVICE

Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

CM

Aggregator PBox

35

Page 52: Scalable Distributed Training with Parameter Hub: a ...

Two-Phase Hierarchical AggregationADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

Aggregator Aggregator

36

Page 53: Scalable Distributed Training with Parameter Hub: a ...

Two-Phase Hierarchical AggregationADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

Aggregator Aggregator1. Intra-Rack central aggregation

36

Page 54: Scalable Distributed Training with Parameter Hub: a ...

Two-Phase Hierarchical AggregationADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

Aggregator Aggregator1. Intra-Rack central aggregation

2. Inter-Rack aggregation

N times traffic reduction!

36

Page 55: Scalable Distributed Training with Parameter Hub: a ...

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

Page 56: Scalable Distributed Training with Parameter Hub: a ...

DPDK-basedlatency Probe

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

Page 57: Scalable Distributed Training with Parameter Hub: a ...

DPDK-basedlatency Probe …

Distance Matrix

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

Page 58: Scalable Distributed Training with Parameter Hub: a ...

DPDK-basedlatency Probe …

Distance Matrix

Clustering Algorithms

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

Page 59: Scalable Distributed Training with Parameter Hub: a ...

DPDK-basedlatency Probe …

Distance Matrix

Clustering Algorithms

Inferred Network Topology*

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

Page 60: Scalable Distributed Training with Parameter Hub: a ...

DPDK-basedlatency Probe …

Distance Matrix

Clustering Algorithms

Inferred Network Topology*

Automagic ScheduleGeneration

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

Page 61: Scalable Distributed Training with Parameter Hub: a ...

Hierarchical Reduction

Plan

DPDK-basedlatency Probe …

Distance Matrix

Clustering Algorithms

Inferred Network Topology*

Automagic ScheduleGeneration

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

Page 62: Scalable Distributed Training with Parameter Hub: a ...

Hierarchical Reduction

Plan

DPDK-basedlatency Probe …

Distance Matrix

Clustering Algorithms

Inferred Network Topology*

Automagic ScheduleGeneration

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

Page 63: Scalable Distributed Training with Parameter Hub: a ...

0

1

1

2

2

Azure (Standard NC6) EC2 (P3.2xlarge)

1.4

2.4

VS Facebook Gloo

Performance in commercial cloud with PHub

38

Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/

Pytorch. ResNet 50.

0

3

5

8

10

Azure (Standard NC6) EC2 (P3.2xlarge)

1.8

9.6

VS Ring Reduction

Page 64: Scalable Distributed Training with Parameter Hub: a ...

Framework IntegrationSupport for Mxnet/Pytorch/Caffe2.

var pHub = std::make_shared<PHub>(cfg.redisIp, nMap, keySize, appAddrs, cntr, sizeof(float), cfg.rank, plp);pHub->ToggleUseSchedule(pSchedule);pHub->Reduce();

Page 65: Scalable Distributed Training with Parameter Hub: a ...

Optimization

Active Topology ProbingYour Cloud

Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own

cluster.

AutoTVM

AutoVTA

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA

ASIC Hardware Fleet

Page 66: Scalable Distributed Training with Parameter Hub: a ...
Page 67: Scalable Distributed Training with Parameter Hub: a ...
Page 68: Scalable Distributed Training with Parameter Hub: a ...

Hardware Parameter Hub

Page 69: Scalable Distributed Training with Parameter Hub: a ...

Hardware Parameter Hub

Page 70: Scalable Distributed Training with Parameter Hub: a ...

Hardware Parameter HubBalanced computation and communication resource.

- 10 ConnectX-3 Card- 560+Gbps Network BW- 800Gbps PCIe- Fully supported by Software

Parameter Hub

Page 71: Scalable Distributed Training with Parameter Hub: a ...

Hardware Parameter Hub35GB/s aggregation throughput.Supports 100+ ResNet-50 training nodes with a single machine.

Gloo HD Gloo Ring PS-Lite PHub SW

2

4.55

7

Page 72: Scalable Distributed Training with Parameter Hub: a ...

Hardware Parameter HubResNet-50.See paper for detailed estimates.

Better training throughput/$.

Page 73: Scalable Distributed Training with Parameter Hub: a ...

Hardware Parameter HubResNet-50.See paper for detailed estimates.

Better training throughput/$.

020406081012141618202225 %