Distributed Training Across the World · Distributed Training Across the World, , , [1]...

1
Distributed Training Across the World , , , [1] Massachusetts Institute of Technology [2] Google Ligeng Zhu [1] Yao Lu [2] Yujun Lin [1] Song Han [1] Traditional distributed training is performed inside a cluster because it requires high end networking infrastructure. But in many cases, the data are distributed in different geographical locations and cannot be centralized. For example, medical history, keyboard inputs. Due to the long physical distance, high latency cannot be avoided and traditional algorithm scales poorly under such conditions. In this work, we aim to scale Synchronous SGD across the world without loss of speed and accuracy! Training inside a cluster v.s. across world London Tokyo Oregon Ohio 102ms 210ms 97ms 70ms Theoretical Results Empirical Results O( 1 NJ )+ O( (t + d ) 2 J N ) Convergence: c < O(N 1 4 J - 3 4 ) no slower than SGD Latency Speed 23 479,000 2,274 3 Inside a cluster Across the world 159,666x longer 98x worse μs μs img/s img/s Scalability 0.00 0.23 0.45 0.68 0.90 Network Latency (ms) 0 1 5 10 50 100 500 1000 SSGD Scalability 0.00 0.23 0.45 0.68 0.90 Network Latency (ms) 0 1 5 10 50 100 500 1000 SSGD DTS (ours) ECD-PSGD ASGD FedAvg 90x 2x 0.008 0.78 >>> T1 T1 T3 T3 1 T2 T2 2 3 T4 T4 4 Vanilla SGD For iteration , on worker n = 0,1, j th w (n, j ) = w (n-1) - w (n-1) T1 T1 T3 T3 T2 T2 1 2 T4 T4 3 w (n, j ) = w (n-1, j ) - ( w (n-1, j ) + w (n-t) - w (n-t, j ) ) Delayed Update For iteration , on worker, when n = 0,1, j th n > t T1 T1 T2 T2 T3 T3 1 2 T4 T4 3 4 w (n, j ) = w (n-1, j ) - ( w (n-1, j ) + d k w (n-t+k) - d k w (n-t+k, j ) ) Temporally Sparse Update For iteration , on worker, when n = 0,1, j th n 0 (mod n) On AWS, eight 8-V100 p3.16x instance located in Oregon, Ohio, London and Tokyo. Our algorithm demonstrates scalability of 0.72 while original SGD suffers at scalability of 0.008 (90x).

Transcript of Distributed Training Across the World · Distributed Training Across the World, , , [1]...

Page 1: Distributed Training Across the World · Distributed Training Across the World, , , [1] Massachusetts Institute of Technology [2] Google Ligeng Zhu[1] Yao Lu[2] Yujun Lin[1] Song

Distributed Training Across the World, , ,

[1] Massachusetts Institute of Technology [2] GoogleLigeng Zhu[1] Yao Lu[2] Yujun Lin[1] Song Han[1]

Traditional distributed training is performed inside a cluster because it requires high end networking infrastructure.But in many cases, the data are distributed in different geographical locations and cannot be centralized. For example, medical history, keyboard inputs.Due to the long physical distance, high latency cannot be avoided and traditional algorithm scales poorly under such conditions.In this work, we aim to scale Synchronous SGD across the world without loss of speed and accuracy!

Training inside a cluster v.s. across world

London

TokyoCaliforniaOhio

277m

63Mbps17Mbp183ms

140ms 23Mbp35ms

13Mbp

Table 1-1-1

Latency Bandwidth Scalability

Inside 3 25600 819.2

Across 158750 29 8.192

1

1,000

1,000,000

Latency (us) Bandwidth (Mbps) Scalability (1024 GPUs)

Across the world Inside a cluster

London

TokyoOregonOhio

102ms

210ms 97ms

70ms

Table 1-1

Latency Scalability

Across the world 3 2274

Inside a cluster 479000 23.32

Latency Speed

Inside a cluster Across the world

159,666x longer 98x worse

µs

µs

img/s

img/s

Theoretical Results

Empirical Results

O( 1NJ

) + O( (t + d)2JN

)Convergence:

c < O(N 14 J! 3

4 ) " no slower than SGD

London

TokyoCaliforniaOhio

277m

63Mbps17Mbp183ms

140ms 23Mbp35ms

13Mbp

Table 1-1-1

Latency Bandwidth Scalability

Inside 3 25600 819.2

Across 158750 29 8.192

1

1,000

1,000,000

Latency (us) Bandwidth (Mbps) Scalability (1024 GPUs)

Across the world Inside a cluster

London

TokyoOregonOhio

102ms

210ms 97ms

70ms

Table 1-1

Latency Scalability

Across the world 3 2274

Inside a cluster 479000 23.32

Latency Speed

23

479,000

2,274

3

Inside a cluster Across the world

159,666x longer 98x worse

µs

µs

img/s

img/s

Scal

abili

ty

0.00

0.23

0.45

0.68

0.90

Network Latency (ms)0 1 5 10 50 100 500 1000

SSGD

Scal

abili

ty

0.00

0.23

0.45

0.68

0.90

Network Latency (ms)0 1 5 10 50 100 500 1000

SSGD DTS (ours) ECD-PSGDASGD FedAvg

90x2x

0.0080.78 >>>

Environment Agent Channel Pruning

Policy gradientDeep-Q Network

Actor Critic

Actions:sparsity for each layer

Observations / Reward: accuracy, FLOPs…

Hardware

Latency

Xxx xxx

11

22

33

44

1 2 3 4

T1

T1

T3

T31

T2

T22 3

T4

T44

T1

T1

T3

T3

T2

T21 2

T4

T43

T1

T1

T2

T2

T3

T3

12

T4

T4

34

:Computation :Synchronization Barrier :Locally update :Send gradients to other

Vanilla SGDFor iteration , on workern = 0,1,# jth

w(n, j) = w(n!1) ! !$w(n!1)

Environment Agent Channel Pruning

Policy gradientDeep-Q Network

Actor Critic

Actions:sparsity for each layer

Observations / Reward: accuracy, FLOPs…

Hardware

Latency

Xxx xxx

11

22

33

44

1 2 3 4

T1

T1

T3

T31

T2

T22 3

T4

T44

T1

T1

T3

T3

T2

T21 2

T4

T43

T1

T1

T2

T2

T3

T3

12

T4

T4

34

:Computation :Synchronization Barrier :Locally update :Send gradients to other

w(n, j) = w(n!1, j)!!($w(n!1, j)+$w(n!t) ! $w(n!t, j))

Delayed UpdateFor iteration , on worker, when n = 0,1,# jth n > t

Environment Agent Channel Pruning

Policy gradientDeep-Q Network

Actor Critic

Actions:sparsity for each layer

Observations / Reward: accuracy, FLOPs…

Hardware

Latency

Xxx xxx

11

22

33

44

1 2 3 4

T1

T1

T3

T31

T2

T22 3

T4

T44

T1

T1

T3

T3

T2

T21 2

T4

T43

T1

T1

T2

T2

T3

T3

12

T4

T4

34

:Computation :Synchronization Barrier :Locally update :Send gradients to other

w(n, j) = w(n!1, j)!!($w(n!1, j)+d

!k

$w(n!t+k) !d

!k

$w(n!t+k, j))

Temporally Sparse UpdateFor iteration , on worker, when n = 0,1,# jth

n % 0 (mod n)

On AWS, eight 8-V100 p3.16x instance located in Oregon, Ohio, London and Tokyo. Our algorithm

demonstrates scalability of 0.72 while original SGD suffers at scalability of 0.008 (90x).