Pre-Con Ed: Managing Test Data Across Distributed and Mainframe Systems
Distributed Training Across the World · Distributed Training Across the World, , , [1]...
Transcript of Distributed Training Across the World · Distributed Training Across the World, , , [1]...
![Page 1: Distributed Training Across the World · Distributed Training Across the World, , , [1] Massachusetts Institute of Technology [2] Google Ligeng Zhu[1] Yao Lu[2] Yujun Lin[1] Song](https://reader036.fdocuments.in/reader036/viewer/2022062607/604974cde6894d54a52e9815/html5/thumbnails/1.jpg)
Distributed Training Across the World, , ,
[1] Massachusetts Institute of Technology [2] GoogleLigeng Zhu[1] Yao Lu[2] Yujun Lin[1] Song Han[1]
Traditional distributed training is performed inside a cluster because it requires high end networking infrastructure.But in many cases, the data are distributed in different geographical locations and cannot be centralized. For example, medical history, keyboard inputs.Due to the long physical distance, high latency cannot be avoided and traditional algorithm scales poorly under such conditions.In this work, we aim to scale Synchronous SGD across the world without loss of speed and accuracy!
Training inside a cluster v.s. across world
London
TokyoCaliforniaOhio
277m
63Mbps17Mbp183ms
140ms 23Mbp35ms
13Mbp
Table 1-1-1
Latency Bandwidth Scalability
Inside 3 25600 819.2
Across 158750 29 8.192
1
1,000
1,000,000
Latency (us) Bandwidth (Mbps) Scalability (1024 GPUs)
Across the world Inside a cluster
London
TokyoOregonOhio
102ms
210ms 97ms
70ms
Table 1-1
Latency Scalability
Across the world 3 2274
Inside a cluster 479000 23.32
Latency Speed
Inside a cluster Across the world
159,666x longer 98x worse
µs
µs
img/s
img/s
Theoretical Results
Empirical Results
O( 1NJ
) + O( (t + d)2JN
)Convergence:
c < O(N 14 J! 3
4 ) " no slower than SGD
London
TokyoCaliforniaOhio
277m
63Mbps17Mbp183ms
140ms 23Mbp35ms
13Mbp
Table 1-1-1
Latency Bandwidth Scalability
Inside 3 25600 819.2
Across 158750 29 8.192
1
1,000
1,000,000
Latency (us) Bandwidth (Mbps) Scalability (1024 GPUs)
Across the world Inside a cluster
London
TokyoOregonOhio
102ms
210ms 97ms
70ms
Table 1-1
Latency Scalability
Across the world 3 2274
Inside a cluster 479000 23.32
Latency Speed
23
479,000
2,274
3
Inside a cluster Across the world
159,666x longer 98x worse
µs
µs
img/s
img/s
Scal
abili
ty
0.00
0.23
0.45
0.68
0.90
Network Latency (ms)0 1 5 10 50 100 500 1000
SSGD
Scal
abili
ty
0.00
0.23
0.45
0.68
0.90
Network Latency (ms)0 1 5 10 50 100 500 1000
SSGD DTS (ours) ECD-PSGDASGD FedAvg
90x2x
0.0080.78 >>>
Environment Agent Channel Pruning
Policy gradientDeep-Q Network
Actor Critic
Actions:sparsity for each layer
Observations / Reward: accuracy, FLOPs…
Hardware
Latency
Xxx xxx
11
22
33
44
1 2 3 4
T1
T1
T3
T31
T2
T22 3
T4
T44
T1
T1
T3
T3
T2
T21 2
T4
T43
T1
T1
T2
T2
T3
T3
12
T4
T4
34
:Computation :Synchronization Barrier :Locally update :Send gradients to other
Vanilla SGDFor iteration , on workern = 0,1,# jth
w(n, j) = w(n!1) ! !$w(n!1)
Environment Agent Channel Pruning
Policy gradientDeep-Q Network
Actor Critic
Actions:sparsity for each layer
Observations / Reward: accuracy, FLOPs…
Hardware
Latency
Xxx xxx
11
22
33
44
1 2 3 4
T1
T1
T3
T31
T2
T22 3
T4
T44
T1
T1
T3
T3
T2
T21 2
T4
T43
T1
T1
T2
T2
T3
T3
12
T4
T4
34
:Computation :Synchronization Barrier :Locally update :Send gradients to other
w(n, j) = w(n!1, j)!!($w(n!1, j)+$w(n!t) ! $w(n!t, j))
Delayed UpdateFor iteration , on worker, when n = 0,1,# jth n > t
Environment Agent Channel Pruning
Policy gradientDeep-Q Network
Actor Critic
Actions:sparsity for each layer
Observations / Reward: accuracy, FLOPs…
Hardware
Latency
Xxx xxx
11
22
33
44
1 2 3 4
T1
T1
T3
T31
T2
T22 3
T4
T44
T1
T1
T3
T3
T2
T21 2
T4
T43
T1
T1
T2
T2
T3
T3
12
T4
T4
34
:Computation :Synchronization Barrier :Locally update :Send gradients to other
w(n, j) = w(n!1, j)!!($w(n!1, j)+d
!k
$w(n!t+k) !d
!k
$w(n!t+k, j))
Temporally Sparse UpdateFor iteration , on worker, when n = 0,1,# jth
n % 0 (mod n)
On AWS, eight 8-V100 p3.16x instance located in Oregon, Ohio, London and Tokyo. Our algorithm
demonstrates scalability of 0.72 while original SGD suffers at scalability of 0.008 (90x).