Scalable Distributed Training with Parameter Hub: a ...

Scalable Distributed Training with Parameter Hub: a whirlwind tour

TVM Stack

Optimization

AutoTVM

AutoVTA

High-Level Differentiable IR

Tensor Expression IR

LLVM, CUDA, Metal VTA

Edge FPGA

Cloud FPGA

ASIC Hardware Fleet

Optimization

AutoTVM

AutoVTA




Edge FPGA

Cloud FPGA

ASICHardware

Fleet

Active Topology ProbingYour Cloud

Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own

cluster.

Parameter Hub

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

Optimized, topology-aware and dynamic mechanism for inter-machine communication

Parameter Hub

Liang Luo, Jacob Nelson, Luis Ceze, Amar Phanishayee and Arvind Krishnamurthy 4

* In the cloud-based training context

Optimized, topology-aware and dynamic mechanism for inter-machine communication

Deep Learning constitutes an important workload in cloud today.

Major cloud providers all have an ecosystem for cloud learning.

5

6

Server demand for DL inference across data centers nearly quadrupled in less than 2 years. Source: Facebook

7EC2 reclaims your GPU instances as they run out of capacity

Distributed TrainingINDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1

A1 O1

F1 B1

Parameter Server

Worker 1

Worker 2

F2 B2

F2 B2

(F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization

A2 O2

F3

F3

B2

B2

Time

Worker Parameter Server

8

Distributed TrainingINDEPENDENT FORWARD/BACKWARD PASSES + COORDINATED PARAMETER EXCHANGE

F1 B1

A1 O1

F1 B1

Parameter Sever

Worker 1

Worker 2

F2 B2

F2 B2

(F)orward Pass (B)ackward Pass (A)ggregation (O)ptimization

A2 O2

F3

F3

B2

B2

Time

Worker Parameter Server

9

Distributed Training TodayIN THE CONTEXT OF THE CLOUD

Network Core

ToR

Machine with GPUs

Machine

ToR

Machine with GPUs

Machine

10

Distributed Training TodayFORWARD AND BACKWARD PASSES IN WORKER

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

11

Distributed Training TodayAGGREGATION AND OPTIMIZATION IN PS

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

12

Distributed training is communication boundResNet 269

Seco

nds

0

0.45

0.9

1.35

1.8

GRID 520 K80 M60 V100

- Problem gets worse over time: shifting bottleneck.

- With modern GPUs most of the time is spent on communication.

- Making GPUs faster will do little to increase throughput

- Wasting compute resources. 13

2012 2014 2015 2017

GPU and Network active

GPU idle, waiting on network

Distributed training is communication bound

Inception V3

AlexNet

GoogleNet

ResNet 269

Bottlenecks in DDNN training

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

MAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

15


Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

FRAMEWORK BOTTLENECKS

16


Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

FRAMEWORK BOTTLENECKS

GPU

TrainingFramework

…

Network

16

Bottlenecks in DDNN trainingFRAMEWORK BOTTLENECKS

17

Bottlenecks in DDNN trainingFRAMEWORK BOTTLENECKS

ResNet 269

Inception

GoogleNet

AlexNet

Seconds

0 0.4 0.8 1.2 1.6

ComputeData Copy and CommunicationAggregatorOptimizerSynchronization and other Overheads

17

Bottlenecks in DDNN trainingMAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

18


Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

BANDWIDTH BOTTLENECK

19

Bottlenecks in Cloud-based DDNN trainingINSUFFICIENT BANDWIDTH

20

25 Gbps

10 Gbps

1000 Gbps

1300 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?

8 workers, GTX 1080 Ti, central parameter servers. MxNet


20

25 Gbps

10 Gbps Cloud Bandwidth

1000 Gbps

1300 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?



20

25 Gbps


1000 Gbps

1300 Gbps

GoogleNet / Inception: 40 Gbps

Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?



20

25 Gbps


1000 Gbps

1300 Gbps


ResNet: 100 Gbps

Minimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?



20

25 Gbps


1000 Gbps

1300 Gbps


ResNet: 100 Gbps

AlexNet: 1200 GbpsMinimum bandwidth required for each of the popular NNs for communication to not bottleneck computation?


Bottlenecks in Cloud-based DDNN trainingMAPPING OF TRAINING WORKLOAD TO THE CLOUD IS INEFFICIENT.

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

21

Bottlenecks in Cloud-based DDNN trainingDEPLOYMENT-RELATED OVERHEAD

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

22

Bottlenecks in Cloud-based DDNN trainingDEPLOYMENT-RELATED OVERHEAD

• Transient congestion, or oversubscription by design

• Cross-rack communication cost is higher than Intra-rack communication.

Hosts

9 Gbps

4 Gbps

1 2 3 4 5 6 7 81

2 4.7

3 8.9

4.7

4 8.9

4.7

8.9

5 8.9

4.7

8.9

8.9

6 4.7

9.0

4.7

4.7

4.7

7 8.9

4.7

9.0

8.9

9.0

4.7

Cluster 1: 1 3 4 5 7Cluster 2: 2 6 8

Hos

ts 23

Parameter Hub OptimizationsCODESIGNING SOFTWARE, HARDWARE WITH CLUSTER CONFIGURATION FOR EFFICIENT CLOUD-BASED DDNN TRAINING

ToR

PS 1

ToR

PS 2

Worker 2

24

Network Core

Eliminating framework bottlenecks:

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

GPU

Data CopyAggregationOptimization

…Network

PHub Optimizations: streamlining DDNN training pipeline

25

Network Core

Eliminating framework bottlenecks:

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

GPU

Data CopyAggregationOptimization

…Network

PHub Optimizations: streamlining DDNN training pipeline

26

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

GRADIENTS

CPU

Software Optimizations

27

MEM

ORY

Software OptimizationsGRADIENT AGGREGATION AND OPTIMIZATION

Each core reads the input Q from different workers and

writes to different locations to the output queue

For each input Q, launch a series of threads for

aggregation. This is used in MxNet. (Wide Aggregation)

Requires synchronization. Great locality. No synchronization

Sequentially aggregates the same portion of gradients within each queue. (Tall

Aggregation)

Organize processors into hierarchy. Perform NUMA

aware tree reduction.

NUMA 0 NUMA 1

Great locality. No synchronization Too much coherence and synchronization

28

Software OptimizationsTALL AGGREGATION AND OPTIMIZATION

- Chunk a gradient into a series of virtual gradients deterministically.

- A virtual gradient is mapped to a particular core on the server.

Gradient Array for Key 0 from 8 workers

Cor

e M

appi

ngs

29




- Virtual gradients are transferred independently.


Cor

e M

appi

ngs

29




- Virtual gradients are transferred independently.

- A chunk is only processed by a single core : maintaining maximum locality.


Cor

e M

appi

ngs

Aggr

egat

ed

29


When Aggregation is done, PHub:- PHub optimizes a chunk with the same

core that aggregates that chunk.


Aggr

egat

ed

30



core that aggregates that chunk.- FP32-level streaming aggregation and

optimization to hide communication latency.


Aggr

egat

ed

31



core that aggregates that chunk.- FP32-level streaming aggregation and

optimization to hide communication latency.


Opt

imiz

ed

Aggr

egat

ed

31

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

32

Network Core

ToR

Worker 1

PS 1

ToR

PS 2

Worker 2

Eliminating deployment bottlenecks: PHub hierarchical reduction: reducing cross rack traffic

33

Two-Phase Hierarchical AggregationRACK SCALE PARAMETER SERVICE

Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

CM

34

Two-Phase Hierarchical AggregationRACK SCALE PARAMETER SERVICE

Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

CM

Aggregator PBox

35

Two-Phase Hierarchical AggregationADAPTING TO THE DATACENTER NETWORK TOPOLOGY

Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

Aggregator Aggregator

36


Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

Aggregator Aggregator1. Intra-Rack central aggregation

36


Cluster Network

ToR

Rack

Worker/PS 1

Worker/PS N

ToR

Worker/PS 1

Worker/PS 2

Aggregator Aggregator1. Intra-Rack central aggregation

2. Inter-Rack aggregation

N times traffic reduction!

36

Efficient DDNN Training in Commercial CloudACTIVE TOPOLOGY PROBING

VMsAzure/EC2

DPDK-basedlatency Probe


VMsAzure/EC2

DPDK-basedlatency Probe …

Distance Matrix


VMsAzure/EC2


Distance Matrix

Clustering Algorithms


VMsAzure/EC2


Distance Matrix


Inferred Network Topology*


VMsAzure/EC2


Distance Matrix



Automagic ScheduleGeneration


VMsAzure/EC2

Hierarchical Reduction

Plan


Distance Matrix



Automagic ScheduleGeneration


VMsAzure/EC2

0

1

1

2

2

Azure (Standard NC6) EC2 (P3.2xlarge)

1.4

2.4

VS Facebook Gloo

Performance in commercial cloud with PHub

38

Windows Azure and Amazon EC2. 32 instances. Up to 10 Gbps. Standard_NC6: Nvidia K80. Batch Size = 512. P3.2xLarge: Nvidia V100. Batch Size = 512. Facebook Caffe2/

Pytorch. ResNet 50.

0

3

5

8

10

Azure (Standard NC6) EC2 (P3.2xlarge)

1.8

9.6

VS Ring Reduction

Framework IntegrationSupport for Mxnet/Pytorch/Caffe2.

var pHub = std::make_shared<PHub>(cfg.redisIp, nMap, keySize, appAddrs, cntr, sizeof(float), cfg.rank, plp);pHub->ToggleUseSchedule(pSchedule);pHub->Reduce();

Optimization

Active Topology ProbingYour Cloud

Groundwork for bringing TVM to the distributed world for training and inference, on commercial cloud, or in your own

cluster.

AutoTVM

AutoVTA




Edge FPGA

Cloud FPGA

ASIC Hardware Fleet

Hardware Parameter Hub

Hardware Parameter HubBalanced computation and communication resource.

- 10 ConnectX-3 Card- 560+Gbps Network BW- 800Gbps PCIe- Fully supported by Software

Parameter Hub

Hardware Parameter Hub35GB/s aggregation throughput.Supports 100+ ResNet-50 training nodes with a single machine.

Gloo HD Gloo Ring PS-Lite PHub SW

2

4.55

7

Hardware Parameter HubResNet-50.See paper for detailed estimates.

Better training throughput/$.

Hardware Parameter HubResNet-50.See paper for detailed estimates.

Better training throughput/$.

020406081012141618202225 %

Scalable Distributed Training with Parameter Hub: a ...

Documents

Transcript of Scalable Distributed Training with Parameter Hub: a ...