Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core...
Transcript of Towards Optimizing the Network for Distributed Machine Learning · 2019-12-18 · Network Core...
Optimizing Network Performance in Distributed Machine Learning
Luo Mai Chuntao Hong Paolo Costa
Machine Learning
• Successful in many fields• Online advertisement
• Spam filtering
• Fraud detection
• Image recognition
• …
• One of the most important workloads in data centers
2
Industry Scale Machine Learning
• More data, higher accuracy
• Scales of industry problems• 100 Billions samples, 1TBs – 1PBs data
• 10 Billions parameters, 1GBs – 1TBs data
• Distributed execution • 100s – 1000s machines
3
Distributed Machine Learning
Data partitions
Model replicas
Workers
W1 W2 W3 W4 Data partitions
Distributed Machine Learning
Data partitions
Model replicas
Workers5
W1 + 0.1 W2 + 0.2 W3 – 0.3 W4 +1.2 W1– 0.9 W2 + 0.5 W3 – 0.1 W4 – 0.5
gradient
Distributed Machine Learning
Data partitions
Model replicas
Workers
1. Push gradients
Parameter server
6
2. Aggregate gradient for each parameter
Distributed Machine Learning
Data partitions
Model replicas
Workers
Parameter server
3. Add gradients to parameters
4. Pull new parameters
7
W1 + g1 W2 + g2 W3 + g3 W4 + g4
Distributed Machine Learning
Data partitions
Model replicas
Workers
Parameter servers
8
W3 W4W1 W2
Use multiple PS to avoid
bottleneck
Distributed Machine Learning
Data partitions
Model replicas
Workers
Parameter servers
Bottleneck
9
Inbound Congestion
10
Network Core
Inbound congestion
Outbound Congestion
11
Outbound congestion
Network Core
Network Core Congestion
12
Over-subscribed Network Core
Congestion in the core in case of over-subscribed
networks
Existing Approaches
13
• Over-provisioning networkExpensive
Limited deployment scale
Not available in public cloudsTraining algorithm
Fast network H/We.g., Infiniband and
RoCE
Existing Approaches
14
• Over-provisioning networkExpensive
Limited deployment scale
Not available in public Clouds
• Asynchronous training algorithmTraining efficiency
Might not converge
Asynchronoustraining algorithm
Network H/W
Rethinking the Network Design
Training algorithm
Network H/W
MLNet
15
MLNet is a communication layerdesigned for distributed machinelearning systems
Improves communication efficiency
Orthogonal to existing approaches
Rethinking the Network Design
Training algorithm
Network H/W
MLNet
16
MLNet is a communication layerdesigned for distributed machinelearning systemsImproves communication efficiency
Orthogonal to existing approaches
Optimizations:Traffic reduction
Flow prioritization
Traffic Reduction
17
18
Aggregate the gradients from 6 workers
𝑔1 = 𝑔11+ 𝑔12+ 𝑔13 + 𝑔14+ 𝑔15+ 𝑔16
Traffic Reduction: Key Insight
Workers
Parameterserver
Aggregation is commutative and associative
19
𝒈𝟏𝟏 + 𝒈𝟏𝟐 +𝒈𝟏𝟑 𝒈𝟏𝟒 + 𝒈𝟏𝟓 +𝒈𝟏𝟔
Aggregate the gradients from 6 workers
Traffic Reduction: Key Insight
Aggregate gradientsincrementally does notchange the final result
Traffic Reduction: Design
20
Intercept the push message from the worker to the PS
Traffic Reduction: Design
21
Redirect the messages to a local worker for partial aggregation
Traffic Reduction: Design
22
Send the partial results to the PS for final aggregation
23
More details on the paper:1. Traffic reduction in pull request2. Asynchronous communication
Traffic Prioritization
24
Traffic Prioritization: Key Insight
25
Job 1 Job 2 Job 3 Job 4
These four TCP flows share a bottleneck link and each of
them gets 25% of its bandwidth
Traffic Prioritization: Key Insight
26
0 1 2 3 4
Flow Completion Time (FCT)
Average completion time is 4Model 1 Model 2 Model 3 Model 4
Job 1
Job 2
Job 3
Job 4
All flows are delayed! TCP per-flow fairness is harmful in distributed
machine learning.
Traffic Prioritization: Key Insight
27
MLNet prioritizes the competing flows to minimize the average training time
Job 1 Job 2 Job 3 Job 4
Traffic Prioritization: Key Insight
28
Job 1
0 1 2 3 4
Job 2
Job 3
Job 4
Flow Completion Time (FCT)
Average completion time is 2Model 1 Model 2 Model 3 Model 4
Shorten average FCT can largely improve average training time
Evaluation
• Simulate common network topology in data centers• Classic 10Gbps 1024-node data center topology [Fat-Tree, SIGCOMM’08]
• Training large scale logistic regression• 65B parameters, 141TB dataset [Parameter Server, OSDI’14]
• 800 workers [Parameter Server, OSDI’14]
• With production trace• Data processing rate: uniform(100, 200) MBps
• Synchronize every 30 seconds
29
30
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
Better
Worse
Cost-effective Expensive
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
31
Traffic Reduction (Non-oversubscribed Net.)
Better
Worse
Cost-effective Expensive
Rack reduces 48% completion time
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
Traffic Reduction (Non-oversubscribed Net.)
32
Better
Worse
Cost-effective Expensive
Deploying more parameter serversresolve edge network bottlenecks
33
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
Better
Worse
Cost-effective Expensive
Deploying more parameter servers to reduce training time(1) uses more machines(2) only possible with non-oversubscribed networks
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Baseline
34
Traffic Reduction (1:4 Oversubscribed Net.)
Better
Worse
Cost-effective Expensive
MLNet reduces congestionin the network core.
Reduces training time by >70%
Traffic Prioritization
• 20 jobs running in the same cluster
35
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
CD
F
Training time (Hours)
Baseline Prioritization
Everyone finish (almost) at the same time
Traffic Prioritization
36
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
CD
F
Training time (Hours)
Baseline Prioritization
Improve themedian by 25%
Delay the tail by 2%
Better Worse
Traffic Prioritization + Traffic Reduction
37
0
0.2
0.4
0.6
0.8
1
0 2 4 6 8 10 12 14
CD
F
Training time (Hours)
Baseline Priori. + Red. Reduction
Improve themedian by 60%
Improve thetail by 54%
Better Worse
38
More details on the paper:1. Binary tree aggregation2. More analysis
Summary
• MLNet can significantly improve the network performance ofdistributed machine learning• Traffic reduction
• Flow prioritization
• Drop-in solution
39
Thanks!
40
Discussion
• Relaxed fault-tolerance?• When worker fails, drop that portion of data
• Adaptive communication• Reduce synchronization when network is busy?
• Hybrid network infrastructure?• Some with 10GE, some with 40GE ROCE, etc.
• Degree of tree?
41
Traffic Reduction: Design
Is the local aggregator a new bottleneck?
42
Example: 15 workers in a rack
Traffic Reduction: Design
Build a balanced aggregation structure such as a binary tree.
43
Example: 15 workers in a rack Binary tree aggregation
44
Traffic Reduction
02468
101214
50 100 200 300 400
Trai
nin
gti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
45
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
46
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
Binary Tree and Rack reduces 78%and 48% completion time
Traffic Reduction (Non-oversubscribed Net.)
47
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
Deploying more parameter serversresolve edge network bottlenecks
48
Traffic Reduction (Non-oversubscribed Net.)
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
Better
Worse
Cost-effective Expensive
Deploying more parameter servers to reduce training time(1) needs more machines(2) only possible with non-oversubscribed networks
49
Traffic Reduction (1:4 Oversubscribed Net.)
Better
Worse
Cost-effective Expensive
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
50
Traffic Reduction (1:4 Oversubscribed Net.)
Better
Worse
Cost-effective Expensive
MLNet reduces congestionin the network core
02468
101214
50 100 200 300 400
Trai
nin
g ti
me
(Ho
urs
)
Number of parameter servers
Rack Binary Baseline
51
Traffic Reduction (1:4 Oversubscribed Net.)
Better
Worse
Cost-effective Expensive
Binary is consistently consumingmore bandwidth than Rack
Example: Training a Neural Network
52
Random init weight
Truth: {cat, dog, cat,…}
Calculate error/gradient
W: {w1, w2, w3, w4}
G: {g1, g2, g3, g4}
W’: {w1’, w2’, w3’, w4’}
Update weights
Example: Neural Network
53
Dog : 99%
Cat : 1%
Model
W1
W4
W2
W3
Train Apply
Model Training
54
Model
W1
W4
W2
W3
Refine model
W1
W4
W2
W3
Random Init Final Model
W1
W4
W2
W3
Converge