Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin...
Transcript of Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin...
![Page 1: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/1.jpg)
Farzin Haddadpour
Trading Redundancy for Communication: Speeding up Distributed SGD for
Non-convex Optimization
![Page 2: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/2.jpg)
Joint work with
Viveck CadambeMehrdad MahdaviMohammad Mahdi Kamani
![Page 3: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/3.jpg)
min f(x) ,X
i
fi(x)Goal: Solving
![Page 4: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/4.jpg)
min f(x) ,X
i
fi(x)Goal: Solving
x(t+1) = x(t) � ⌘1
|⇠(t)|rf(x(t); ⇠(t))SGD
![Page 5: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/5.jpg)
min f(x) ,X
i
fi(x)Goal: Solving
x(t+1) = x(t) � ⌘1
|⇠(t)|rf(x(t); ⇠(t))
Parallelization due to computational cost
x(t+1) = x(t) � ⌘
p
pX
j=1
1
|⇠(t)j |rf(x(t); ⇠(t)j )
SGD
Distributed SGD
![Page 6: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/6.jpg)
min f(x) ,X
i
fi(x)Goal: Solving
x(t+1) = x(t) � ⌘1
|⇠(t)|rf(x(t); ⇠(t))
Parallelization due to computational cost
Communication is bottleneck
x(t+1) = x(t) � ⌘
p
pX
j=1
1
|⇠(t)j |rf(x(t); ⇠(t)j )
SGD
Distributed SGD
![Page 7: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/7.jpg)
Communication
Number of bits per iteration
Gradient compression based techniques
![Page 8: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/8.jpg)
Communication
Number of roundsNumber of bits per iteration
Gradient compression based techniques
Local SGD with periodic averaging
![Page 9: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/9.jpg)
Local SGD with periodic averaging
x(t+1)j = 1
p
Ppj=1
hx(t)j � ⌘ g̃(t)
j
iif ⌧ |T
x(t+1)j = x(t)
j � ⌘ g̃(t)j otherwise,
Averaging step (a)
Local update (b)
![Page 10: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/10.jpg)
Local SGD with periodic averaging
x(t+1)j = 1
p
Ppj=1
hx(t)j � ⌘ g̃(t)
j
iif ⌧ |T
x(t+1)j = x(t)
j � ⌘ g̃(t)j otherwise,
Averaging step (a)
Local update (b)
W1
W1 W3W2
W2 W3
W1 W2 W3
W1 W3W2
(a)
(a)
(a)
p = 3, ⌧ = 1
![Page 11: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/11.jpg)
Local SGD with periodic averaging
x(t+1)j = 1
p
Ppj=1
hx(t)j � ⌘ g̃(t)
j
iif ⌧ |T
x(t+1)j = x(t)
j � ⌘ g̃(t)j otherwise,
Averaging step (a)
Local update (b)
W1
W1
W1
W3
W2
W2
W2 W3
W3
(a)
(b)
p = 3, ⌧ = 1
W1
W1 W3W2
W2 W3
W1 W2 W3
W1 W3W2
(a)
(a)
(a)
p = 3, ⌧ = 3
![Page 12: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/12.jpg)
Convergence Analysis of Local SGD with periodic averaging
Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/
ppT ) i.i.d. & b.g T
[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p
34T
14 )
[Wang & Joshi] O(1/ppT ) i.i.d. O(p
32T
12 )
RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p
32T
12 )
b.g: Bounded gradient kgik22 G
Unbiased gradient estimation E[g̃j ] = gj
![Page 13: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/13.jpg)
Convergence Analysis of Local SGD with periodic averaging
Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/
ppT ) i.i.d. & b.g T
[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p
34T
14 )
[Wang & Joshi] O(1/ppT ) i.i.d. O(p
32T
12 )
RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p
32T
12 )
b.g: Bounded gradient kgik22 G
A. Residual error is observe in practice but theoretical understanding is missing?
B. How we can capture this in convergence analysis?
C. Any solution to improve it?
Unbiased gradient estimation E[g̃j ] = gj
![Page 14: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/14.jpg)
Insufficiency of convergence analysis
A. Residual error is observe in practice but theoretical understanding is missing?
Unbiased gradient estimation does not hold
![Page 15: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/15.jpg)
Insufficiency of convergence analysis
A. Residual error is observe in practice but theoretical understanding is missing?
B. How to capture this in convergence analysis?
Unbiased gradient estimation does not hold
Analysis based on biased gradientsOur work
![Page 16: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/16.jpg)
Insufficiency of convergence analysis
A. Residual error is observe in practice but theoretical understanding is missing?
B. How to capture this in convergence analysis?
C. Any solution to improve it?
Unbiased gradient estimation does not hold
Analysis based on biased gradients
Redundancy
Our work
Our work
![Page 17: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/17.jpg)
Redundancy infused local SGD (RI-SGD)
D = D1 [D2 [D3
W1
W1
W1
W3
W2
W2
W2 W3
W3
D1 D2 D3
p = 3, ⌧ = 3Local SGD
![Page 18: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/18.jpg)
Redundancy infused local SGD (RI-SGD)
D = D1 [D2 [D3
W1
W1
W1
W3
W2
W2
W2 W3
W3
D1 D2 D3
p = 3, ⌧ = 3Local SGD RI-SGD q = 2, p = 3, ⌧ = 3
W1
W1
W3
W2
W2
W3
W1
D1D2 D2 D1
W2
D3 D3
W3
Explicit redundancy
![Page 19: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/19.jpg)
Comparing RI-SGD with other schemes
b.d: Bounded inner product of gradients hgi,gji �Assumption
Redundancy
Biased gradients
q: Number of data chunks at each worker node
![Page 20: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/20.jpg)
Comparing RI-SGD with other schemes
b.d: Bounded inner product of gradients hgi,gji �Assumption
Redundancy
Table 1: Comparison of di↵erent SGD based algorithms.Strategy Convergence error Assumptions Com-round(T/⌧)SGD O(1/
ppT ) i.i.d. & b.g T
[Yu et.al.] O(1/ppT ) i.i.d. & b.g O(p
34T
14 )
[Wang & Joshi] O(1/ppT ) i.i.d. O(p
32T
12 )
RI-SGD (⌧, q) O(1/ppT ) +O((1� q/p)�) non-i.i.d. & b.d. O(p
32T
12 )
Biased gradients
q: Number of data chunks at each worker node
![Page 21: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/21.jpg)
1. Speed up not only due to larger effective mini-batch size, but also due to increasing intra-gradient diversity.
2. Fault-tolerance. 3. Extension to heterogeneous mini-batch size
and possible application to federated optimization.
Advantages of RI-SGD:
![Page 22: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/22.jpg)
Faster convergence: Experiments over Image-net (top figures) and Cifar-100 (bottom figures)
![Page 23: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/23.jpg)
Increasing intra-gradient diversity: Experiments over Cifar-10
![Page 24: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/24.jpg)
Fault-Tolerance: Experiments over Cifar-10
![Page 25: Trading Redundancy for Communication: Speeding up ...12-11-00)-12-11-35-5083... · Farzin Haddadpour Trading Redundancy for Communication: Speeding up Distributed SGD for Non-convex](https://reader034.fdocuments.in/reader034/viewer/2022042109/5e8a2d810464ac66702092c2/html5/thumbnails/25.jpg)
For more details please come to my poster session
Wed Jun 12th 06:30 -- 09:00 PM @ Pacific
Ballroom #185
Thanks for your attention!