Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core...

54
Optimizing Network Performance in Distributed Machine Learning Luo Mai Chuntao Hong Paolo Costa

Transcript of Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core...

Page 1: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Optimizing Network Performance in Distributed Machine Learning

Luo Mai Chuntao Hong Paolo Costa

Page 2: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Machine Learning

• Successful in many fields• Online advertisement

• Spam filtering

• Fraud detection

• Image recognition

• …

• One of the most important workloads in data centers

2

Page 3: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Industry Scale Machine Learning

• More data, higher accuracy

• Scales of industry problems• 100 Billions samples, 1TBs – 1PBs data

• 10 Billions parameters, 1GBs – 1TBs data

• Distributed execution • 100s – 1000s machines

3

Page 4: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Distributed Machine Learning

Data partitions

Model replicas

Workers

W1 W2 W3 W4 Data partitions

Page 5: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Distributed Machine Learning

Data partitions

Model replicas

Workers5

W1 + 0.1 W2 + 0.2 W3 – 0.3 W4 +1.2 W1– 0.9 W2 + 0.5 W3 – 0.1 W4 – 0.5

gradient

Page 6: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Distributed Machine Learning

Data partitions

Model replicas

Workers

1. Push gradients

Parameter server

6

2. Aggregate gradient for each parameter

Page 7: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Distributed Machine Learning

Data partitions

Model replicas

Workers

Parameter server

3. Add gradients to parameters

4. Pull new parameters

7

W1 + g1 W2 + g2 W3 + g3 W4 + g4

Page 8: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Distributed Machine Learning

Data partitions

Model replicas

Workers

Parameter servers

8

W3 W4W1 W2

Use multiple PS to avoid

bottleneck

Page 9: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Distributed Machine Learning

Data partitions

Model replicas

Workers

Parameter servers

Bottleneck

9

Page 10: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Inbound Congestion

10

Network Core

Inbound congestion

Page 11: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Outbound Congestion

11

Outbound congestion

Network Core

Page 12: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Network Core Congestion

12

Over-subscribed Network Core

Congestion in the core in case of over-subscribed

networks

Page 13: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Existing Approaches

13

• Over-provisioning networkExpensive

Limited deployment scale

Not available in public cloudsTraining algorithm

Fast network H/We.g., Infiniband and

RoCE

Page 14: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Existing Approaches

14

• Over-provisioning networkExpensive

Limited deployment scale

Not available in public Clouds

• Asynchronous training algorithmTraining efficiency

Might not converge

Asynchronoustraining algorithm

Network H/W

Page 15: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Rethinking the Network Design

Training algorithm

Network H/W

MLNet

15

MLNet is a communication layerdesigned for distributed machinelearning systems

Improves communication efficiency

Orthogonal to existing approaches

Page 16: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Rethinking the Network Design

Training algorithm

Network H/W

MLNet

16

MLNet is a communication layerdesigned for distributed machinelearning systemsImproves communication efficiency

Orthogonal to existing approaches

Optimizations:Traffic reduction

Flow prioritization

Page 17: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Reduction

17

Page 18: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

18

Aggregate the gradients from 6 workers

𝑔1 = 𝑔11+ 𝑔12+ 𝑔13 + 𝑔14+ 𝑔15+ 𝑔16

Traffic Reduction: Key Insight

Workers

Parameterserver

Aggregation is commutative and associative

Page 19: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

19

𝒈𝟏𝟏 + 𝒈𝟏𝟐 +𝒈𝟏𝟑 𝒈𝟏𝟒 + 𝒈𝟏𝟓 +𝒈𝟏𝟔

Aggregate the gradients from 6 workers

Traffic Reduction: Key Insight

Aggregate gradientsincrementally does notchange the final result

Page 20: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Reduction: Design

20

Intercept the push message from the worker to the PS

Page 21: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Reduction: Design

21

Redirect the messages to a local worker for partial aggregation

Page 22: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Reduction: Design

22

Send the partial results to the PS for final aggregation

Page 23: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

23

More details on the paper:1. Traffic reduction in pull request2. Asynchronous communication

Page 24: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Prioritization

24

Page 25: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Prioritization: Key Insight

25

Job 1 Job 2 Job 3 Job 4

These four TCP flows share a bottleneck link and each of

them gets 25% of its bandwidth

Page 26: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Prioritization: Key Insight

26

0 1 2 3 4

Flow Completion Time (FCT)

Average completion time is 4Model 1 Model 2 Model 3 Model 4

Job 1

Job 2

Job 3

Job 4

All flows are delayed! TCP per-flow fairness is harmful in distributed

machine learning.

Page 27: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Prioritization: Key Insight

27

MLNet prioritizes the competing flows to minimize the average training time

Job 1 Job 2 Job 3 Job 4

Page 28: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Prioritization: Key Insight

28

Job 1

0 1 2 3 4

Job 2

Job 3

Job 4

Flow Completion Time (FCT)

Average completion time is 2Model 1 Model 2 Model 3 Model 4

Shorten average FCT can largely improve average training time

Page 29: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Evaluation

• Simulate common network topology in data centers• Classic 10Gbps 1024-node data center topology [Fat-Tree, SIGCOMM’08]

• Training large scale logistic regression• 65B parameters, 141TB dataset [Parameter Server, OSDI’14]

• 800 workers [Parameter Server, OSDI’14]

• With production trace• Data processing rate: uniform(100, 200) MBps

• Synchronize every 30 seconds

29

Page 30: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

30

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

Better

Worse

Cost-effective Expensive

Page 31: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

31

Traffic Reduction (Non-oversubscribed Net.)

Better

Worse

Cost-effective Expensive

Rack reduces 48% completion time

Page 32: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

Traffic Reduction (Non-oversubscribed Net.)

32

Better

Worse

Cost-effective Expensive

Deploying more parameter serversresolve edge network bottlenecks

Page 33: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

33

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

Better

Worse

Cost-effective Expensive

Deploying more parameter servers to reduce training time(1) uses more machines(2) only possible with non-oversubscribed networks

Page 34: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Baseline

34

Traffic Reduction (1:4 Oversubscribed Net.)

Better

Worse

Cost-effective Expensive

MLNet reduces congestionin the network core.

Reduces training time by >70%

Page 35: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Prioritization

• 20 jobs running in the same cluster

35

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

CD

F

Training time (Hours)

Baseline Prioritization

Everyone finish (almost) at the same time

Page 36: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Prioritization

36

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

CD

F

Training time (Hours)

Baseline Prioritization

Improve themedian by 25%

Delay the tail by 2%

Better Worse

Page 37: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Prioritization + Traffic Reduction

37

0

0.2

0.4

0.6

0.8

1

0 2 4 6 8 10 12 14

CD

F

Training time (Hours)

Baseline Priori. + Red. Reduction

Improve themedian by 60%

Improve thetail by 54%

Better Worse

Page 38: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

38

More details on the paper:1. Binary tree aggregation2. More analysis

Page 39: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Summary

• MLNet can significantly improve the network performance ofdistributed machine learning• Traffic reduction

• Flow prioritization

• Drop-in solution

39

Page 40: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Thanks!

40

Page 41: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Discussion

• Relaxed fault-tolerance?• When worker fails, drop that portion of data

• Adaptive communication• Reduce synchronization when network is busy?

• Hybrid network infrastructure?• Some with 10GE, some with 40GE ROCE, etc.

• Degree of tree?

41

Page 42: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Reduction: Design

Is the local aggregator a new bottleneck?

42

Example: 15 workers in a rack

Page 43: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Reduction: Design

Build a balanced aggregation structure such as a binary tree.

43

Example: 15 workers in a rack Binary tree aggregation

Page 44: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

44

Traffic Reduction

02468

101214

50 100 200 300 400

Trai

nin

gti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

Page 45: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

45

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

Page 46: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

46

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

Binary Tree and Rack reduces 78%and 48% completion time

Page 47: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Traffic Reduction (Non-oversubscribed Net.)

47

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

Deploying more parameter serversresolve edge network bottlenecks

Page 48: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

48

Traffic Reduction (Non-oversubscribed Net.)

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Better

Worse

Cost-effective Expensive

Deploying more parameter servers to reduce training time(1) needs more machines(2) only possible with non-oversubscribed networks

Page 49: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

49

Traffic Reduction (1:4 Oversubscribed Net.)

Better

Worse

Cost-effective Expensive

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

Page 50: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

50

Traffic Reduction (1:4 Oversubscribed Net.)

Better

Worse

Cost-effective Expensive

MLNet reduces congestionin the network core

Page 51: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

02468

101214

50 100 200 300 400

Trai

nin

g ti

me

(Ho

urs

)

Number of parameter servers

Rack Binary Baseline

51

Traffic Reduction (1:4 Oversubscribed Net.)

Better

Worse

Cost-effective Expensive

Binary is consistently consumingmore bandwidth than Rack

Page 52: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Example: Training a Neural Network

52

Random init weight

Truth: {cat, dog, cat,…}

Calculate error/gradient

W: {w1, w2, w3, w4}

G: {g1, g2, g3, g4}

W’: {w1’, w2’, w3’, w4’}

Update weights

Page 53: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Example: Neural Network

53

Dog : 99%

Cat : 1%

Model

W1

W4

W2

W3

Train Apply

Page 54: Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core Congestion 12 Over-subscribed Network Core Congestion in the core in case of over-subscribed

Model Training

54

Model

W1

W4

W2

W3

Refine model

W1

W4

W2

W3

Random Init Final Model

W1

W4

W2

W3

Converge