Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online...

27
Consensus-Based Distributed Online Prediction and Stochastic Optimization Michael Rabbat Joint work with Konstantinos Tsianos

Transcript of Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online...

Page 1: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Consensus-Based Distributed Online Prediction and Stochastic Optimization

Michael Rabbat

Joint work with ��� Konstantinos Tsianos

Page 2: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Stochastic Online Prediction

2

Observe x(1)x(2) x(3) . . .

w(1) w(2) w(3) . . .

f(w(1), x(1)) f(w(2), x(2)) f(w(3), x(3))

Predict

Suffer Loss

x(t)Assume drawn i.i.d. from unknown distribution

R(m) =mX

t=1

f(w(t), x(t))�mX

t=1

f(w⇤, x(t))Regret:

where w⇤ = arg minw2W

Ex

[f(w, x)]

See, e.g., S. Shalev-Shwartz, “Online Learning and Online Convex Optimization, FnT 2012.

Page 3: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Example Application: Internet Advertising

3

w(t)Prediction

sample/data x(t)

Page 4: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Problem Formalization

4

Assume: |f(w, x)� f(w

0, x)| Lkw � w

0k (Lipschitz continuous)

krf(w, x)�rf(w

0, x)k Kkw � w

0k (Lipschitz cont. gradients)

f(w, x) is convex in w for all x

Best possible performance (Nemirovsky & Yudin ‘83) is Achieved by many algorithms including Nesterov’s Dual Averaging

R(m) =mX

t=1

f(w(t), x(t))�mX

t=1

f(w⇤, x(t))Regret:

E⇥krf(w, x)� E[rf(w, x)]k2

⇤ �

2(Bounded variance)

E[R(m)] = O(pm)

Page 5: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Dual Averaging Intuition

5

F (w) = Ex

[f(w, x)]

(w1, F (w1))

hrF (w1), w � w1i

Page 6: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Dual Averaging Intuition

6

F (w) = Ex

[f(w, x)]

Page 7: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Dual Averaging Intuition

7

F (w) = Ex

[f(w, x)]

Lower linear model

l(w) =

Pki=1 �i(f(wi) + hrF (wi), w � wii)Pk

i=1 �i

Page 8: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Algorithm (Dual Averaging)

Initialize: z(1) = 0 2 Rd, w(1) = 0

Repeat (for t > 1):

1. g(t) = rf(w(t), x(t))

2. z(t+ 1) = z(t) + g(t)

3. w(t+ 1) = argminw2W

n

hz(t+ 1), wi+ 1a(t)kwk

2o

The Dual Averaging Algorithm

8

Nesterov (2009); see also Zinkevich (2003)

Gradient aggregation

“Projection”

Page 9: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Algorithm (Dual Averaging)

Initialize: z(1) = 0 2 Rd, w(1) = 0

Repeat (for t > 1):

1. g(t) = rf(w(t), x(t))

2. z(t+ 1) = z(t) + g(t)

3. w(t+ 1) = argminw2W

n

hz(t+ 1), wi+ 1a(t)kwk

2o

The Dual Averaging Algorithm

9

Nesterov (2009); see also Zinkevich (2003)

Gradient aggregation

“Projection”

Best possible performance (Nemirovsky & Yudin ‘83, Cesa-Bianchi & Lugosi ‘06)

Theorem (Nesterov ’09): For a(t) = 1/pt we have E[R(m)] = O(

pm).

E[R(m)] = O(pm)

Page 10: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Distributed Online Prediction

10

w1(1), w1(2), w1(3), . . .

x1(1), x1(2), x1(3), . . . w2(1), w2(2), w2(3), . . .

x2(1), x2(2), x2(3), . . .

w3(1), w3(2), w3(3), . . .

x3(1), x3(2), x3(3), . . .

wn�1(1), wn�1(2), wn�1(3), . . .

xn�1(1), xn�1(2), xn�1(3), . . .wn(1), wn(2), wn(3), . . .

xn(1), xn(2), xn(3), . . .

Regret: Rn(m) =

nX

i=1

m/nX

t=1

⇥f(wi(t), xi(t))� f(w⇤

, x(t))⇤

Communication topology given by graph G = (V,E)

Page 11: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Distributed Online Prediction

11

“No collaboration”

Regret: Rn(m) = nR1(mn )

= O(pnm)

With collaboration…

Rn(m) R1(m) = O(pm)

How to achieve this bound?

Page 12: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Mini-batch Updates

12

[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]

Observe x(1)

x(2)

w(1)

f(w(1), x(1))

Predict

Suffer Loss

. . .

f(w(1), x(2)) f(w(1), x(b)) f(w(b+ 1), x(b+ 1))

w(b+ 1)

x(b+ 1)x(b)

w(1)

. . .

. . .

w(b) = w(1)

Update using average gradient

Regret: E[R1(m)] = O(b+pm+ b)

Update after mini-batch of b samples

1

b

bX

t=1

rf(w(1), x(t))

Page 13: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Distributed Mini-Batch Algorithm

13

[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]

Distribute each mini-batch of b samples across n nodes Aggregate (synchronously) along a spanning tree (e.g., using ALLREDUCE)

All nodes exactly compute 1

b

bX

t=1

rf(w(1), x(t))

Collaborating has latency samples µ

Regret: Rn(m) =

mb+µX

t=1

nX

i=1

b+µnX

s=1

[f(wi(t), xi(t, s))� f(w⇤, xi(t, s))]

Page 14: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Distributed Mini-Batch Algorithm

14

[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]

Distribute each mini-batch of b samples across n nodes Aggregate (synchronously) along a spanning tree (e.g., using ALLREDUCE)

Is exact average gradient computation necessary? Can we achieve the same rates with asynchronous distributed algorithms?

All nodes exactly compute 1

b

bX

t=1

rf(w(1), x(t))

Achieve optimal regret E[Rn(m)] = O(pm)

with appropriate choice of b

Page 15: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Approximate Distributed Averaging

ALLREDUCE is an example of an exact distributed averaging protocol

15

y+i = AllReduce(yi/n, i) ⌘1

n

nX

i=1

yi 8i

Page 16: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Approximate Distributed Averaging

ALLREDUCE is an example of an exact distributed averaging protocol

More generally, consider approximate distributed averaging protocols which guarantee that for all i with latency

16

y+i = AllReduce(yi/n, i) ⌘1

n

nX

i=1

yi 8i

y+i = DistributedAverage(yi, i)

ky+i � 1n

nX

i=1

yik �

µ

Page 17: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Gossip Algorithms For a doubly-stochastic matrix with consider (synchronous) linear iterations

Then and if

17

W Wi,j > 0 , (i, j) 2 E

yi(k + 1) = Wi,iyi(k) +nX

j=1

Wi,jyj(k)

kyi(k)� yk �yi(k) ! ydef= 1

n

Pni=1 yi(0)

k �log

�1� ·

pn ·maxj kyj(0)� yk

1� �2(W )

ring expander

1

1� �2= O(1)

random geometric graph

1

1� �2= O(n2)

1

1� �2= O

✓n

log(n)

Page 18: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Gossip Algorithms For a doubly-stochastic matrix with consider (synchronous) linear iterations

Then and if

18

W Wi,j > 0 , (i, j) 2 E

yi(k + 1) = Wi,iyi(k) +nX

j=1

Wi,jyj(k)

kyi(k)� yk �yi(k) ! ydef= 1

n

Pni=1 yi(0)

k �log

�1� ·

pn ·maxj kyj(0)� yk

1� �2(W )

Related work: •  Tsitsiklis, Bertsekas, & Athans 1986 •  Nedic & Ozdaglar 2009 •  Ram, Nedic, & Veeravalli 2010 •  Duchi, Agarwal, & Wainwright 2012

Page 19: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Distributed Dual Averaging with Approximate Mini-Batches (DDA-AMB)

19

Initialize zi(1) = 0, wi(1) = 0

For t = 1, . . . , Tdef= d m

b+µe

gi(t) =n

b

b/nX

s=1

rf(wi(t), xi(t, s))

zi(t+ 1) = DistributedAverage(zi(t) + gi(t), i)

Algorithm parameters 0 < �(t) �(t+ 1)

wi(t+ 1) = arg minw2W

�hzi(t+ 1), wi+ �(t)h(w)

Strongly convex aux. function E.g., . kwk22

Page 20: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Distributed Dual Averaging with Approximate Mini-Batches (DDA-AMB)

20

Initialize zi(1) = 0, wi(1) = 0

For t = 1, . . . , Tdef= d m

b+µe

gi(t) =n

b

b/nX

s=1

rf(wi(t), xi(t, s))

zi(t+ 1) = DistributedAverage(zi(t) + gi(t), i)

zi(t+ 1) ⇡ 1

n

nX

i=1

�zi(t) + gi(t)

= z(t) +1

b

nX

i=1

b/nX

s=1

rf(wi(t), xi(t, s))

Should give

b samples

μ samples

wi(t+ 1) = arg minw2W

�hzi(t+ 1), wi+ �(t)h(w)

Page 21: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

When do Approximate Mini-Batches Work?

21

If G is an expander, then and so . k = ⇥(log n) µ = ⇥(log n)

Latency is the same (order-wise) as aggregating along a tree.

Theorem (Tsianos & MR): Run DDA-AMB with

k =

log

�2

pn(1 + 2Lb)

1� �2(W )

iterations of gossip per mini-batch and �(t) = K +

qt

b+µ , and

take b = m⇢for ⇢ 2 (0, 1

2 ). Then E[Rn(m)] = O(

pm).

Page 22: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Stochastic Optimization

22

Consider the problem

Well-known that

minimize F (w) = Ex

[f(w, x)]

subject to w 2 W

F (w(m))� F (w⇤) 1

mE[R1(m)]

w(m) =1

m

mX

t=1

w(t)where

Page 23: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Distributed Stochastic Optimization

23

Accuracy is guaranteed if F (wi(T ))� F (w⇤) ✏ T � 1

n· 1

✏2

Total gossip iterations: O✓

1

✏2· log n

n· 1

1� �2(W )

Agarwal & Duchi (2011) obtain similar rates with an asynchronous master-worker architecture

Corollary: Run DDA-AMB with �(t) = K +

qtb and

k =

log

�(1 + 2Lb)

pn�

1� �2(W )

gossip iterations per mini-batch of b gradients processed across the network.

Then

F (wi(dmb e))� F (w⇤

) = O(

1pm) = O(

1pnT

) .

Page 24: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Experimental Evaluation

24

Asynchronous version of distributed dual averaging using Matlab’s Distributed Computing Server (wraps to MPI) on a cluster with n=64 processors

Solve a multinomial logistic regression task

Data used: MNIST digits

y 2 {0, 1, . . . , 9}x 2 R784

w 2 R7850

f(w, (x, y)) =

1

Z(w, (x, y))

exp

8<

:wy,d+1 +

dX

j=1

wy,dxd

9=

;

Page 25: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Performance scaling with more nodes

25

0 500 1000 1500 2000 25000

0.5

1

1.5

2

2.5

Run Time (s)

1 mR

n(m

)

n = 4n = 8n = 16n = 32n = 64

Page 26: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Performance scaling

26 0 1 2 3 4 5 6

x 104

0.5

1

1.5

2

2.5

3

T

R4(T

)/R

n(T

)

n=8n=16n=32

For fixed b, have E[Rn(T )] = O(

pTn)

Expect that

Rn(T )

Rn0(T )

⇡r

n

n0

p2 ⇡ 1.4

p4 = 2

p8 ⇡ 2.8

Page 27: Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online Prediction and Stochastic Optimization! Michael Rabbat!!!!!Joint work with "!!!!!

Conclusions

•  Exact averaging is not crucial for regret with distributed mini-batches –  Just need to ensure nodes don’t drift too far apart

•  Current gossip bounds are worst-case in initial condition –  Potentially use an adaptive rule to gossip less

•  Fully asynchronous version is straightforward extension

•  Open problem: –  Lower bounds for distributed optimization (communication +

computation) –  Exploiting sparsity in distributed optimization

27

O(pm)