Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online...

Post on 15-Sep-2018

219 views 0 download

Transcript of Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online...

Consensus-Based Distributed Online Prediction and Stochastic Optimization

Michael Rabbat

Joint work with ��� Konstantinos Tsianos

Stochastic Online Prediction

2

Observe x(1)x(2) x(3) . . .

w(1) w(2) w(3) . . .

f(w(1), x(1)) f(w(2), x(2)) f(w(3), x(3))

Predict

Suffer Loss

x(t)Assume drawn i.i.d. from unknown distribution

R(m) =mX

t=1

f(w(t), x(t))�mX

t=1

f(w⇤, x(t))Regret:

where w⇤ = arg minw2W

Ex

[f(w, x)]

See, e.g., S. Shalev-Shwartz, “Online Learning and Online Convex Optimization, FnT 2012.

Example Application: Internet Advertising

3

w(t)Prediction

sample/data x(t)

Problem Formalization

4

Assume: |f(w, x)� f(w

0, x)| Lkw � w

0k (Lipschitz continuous)

krf(w, x)�rf(w

0, x)k Kkw � w

0k (Lipschitz cont. gradients)

f(w, x) is convex in w for all x

Best possible performance (Nemirovsky & Yudin ‘83) is Achieved by many algorithms including Nesterov’s Dual Averaging

R(m) =mX

t=1

f(w(t), x(t))�mX

t=1

f(w⇤, x(t))Regret:

E⇥krf(w, x)� E[rf(w, x)]k2

⇤ �

2(Bounded variance)

E[R(m)] = O(pm)

Dual Averaging Intuition

5

F (w) = Ex

[f(w, x)]

(w1, F (w1))

hrF (w1), w � w1i

Dual Averaging Intuition

6

F (w) = Ex

[f(w, x)]

Dual Averaging Intuition

7

F (w) = Ex

[f(w, x)]

Lower linear model

l(w) =

Pki=1 �i(f(wi) + hrF (wi), w � wii)Pk

i=1 �i

Algorithm (Dual Averaging)

Initialize: z(1) = 0 2 Rd, w(1) = 0

Repeat (for t > 1):

1. g(t) = rf(w(t), x(t))

2. z(t+ 1) = z(t) + g(t)

3. w(t+ 1) = argminw2W

n

hz(t+ 1), wi+ 1a(t)kwk

2o

The Dual Averaging Algorithm

8

Nesterov (2009); see also Zinkevich (2003)

Gradient aggregation

“Projection”

Algorithm (Dual Averaging)

Initialize: z(1) = 0 2 Rd, w(1) = 0

Repeat (for t > 1):

1. g(t) = rf(w(t), x(t))

2. z(t+ 1) = z(t) + g(t)

3. w(t+ 1) = argminw2W

n

hz(t+ 1), wi+ 1a(t)kwk

2o

The Dual Averaging Algorithm

9

Nesterov (2009); see also Zinkevich (2003)

Gradient aggregation

“Projection”

Best possible performance (Nemirovsky & Yudin ‘83, Cesa-Bianchi & Lugosi ‘06)

Theorem (Nesterov ’09): For a(t) = 1/pt we have E[R(m)] = O(

pm).

E[R(m)] = O(pm)

Distributed Online Prediction

10

w1(1), w1(2), w1(3), . . .

x1(1), x1(2), x1(3), . . . w2(1), w2(2), w2(3), . . .

x2(1), x2(2), x2(3), . . .

w3(1), w3(2), w3(3), . . .

x3(1), x3(2), x3(3), . . .

wn�1(1), wn�1(2), wn�1(3), . . .

xn�1(1), xn�1(2), xn�1(3), . . .wn(1), wn(2), wn(3), . . .

xn(1), xn(2), xn(3), . . .

Regret: Rn(m) =

nX

i=1

m/nX

t=1

⇥f(wi(t), xi(t))� f(w⇤

, x(t))⇤

Communication topology given by graph G = (V,E)

Distributed Online Prediction

11

“No collaboration”

Regret: Rn(m) = nR1(mn )

= O(pnm)

With collaboration…

Rn(m) R1(m) = O(pm)

How to achieve this bound?

Mini-batch Updates

12

[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]

Observe x(1)

x(2)

w(1)

f(w(1), x(1))

Predict

Suffer Loss

. . .

f(w(1), x(2)) f(w(1), x(b)) f(w(b+ 1), x(b+ 1))

w(b+ 1)

x(b+ 1)x(b)

w(1)

. . .

. . .

w(b) = w(1)

Update using average gradient

Regret: E[R1(m)] = O(b+pm+ b)

Update after mini-batch of b samples

1

b

bX

t=1

rf(w(1), x(t))

Distributed Mini-Batch Algorithm

13

[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]

Distribute each mini-batch of b samples across n nodes Aggregate (synchronously) along a spanning tree (e.g., using ALLREDUCE)

All nodes exactly compute 1

b

bX

t=1

rf(w(1), x(t))

Collaborating has latency samples µ

Regret: Rn(m) =

mb+µX

t=1

nX

i=1

b+µnX

s=1

[f(wi(t), xi(t, s))� f(w⇤, xi(t, s))]

Distributed Mini-Batch Algorithm

14

[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]

Distribute each mini-batch of b samples across n nodes Aggregate (synchronously) along a spanning tree (e.g., using ALLREDUCE)

Is exact average gradient computation necessary? Can we achieve the same rates with asynchronous distributed algorithms?

All nodes exactly compute 1

b

bX

t=1

rf(w(1), x(t))

Achieve optimal regret E[Rn(m)] = O(pm)

with appropriate choice of b

Approximate Distributed Averaging

ALLREDUCE is an example of an exact distributed averaging protocol

15

y+i = AllReduce(yi/n, i) ⌘1

n

nX

i=1

yi 8i

Approximate Distributed Averaging

ALLREDUCE is an example of an exact distributed averaging protocol

More generally, consider approximate distributed averaging protocols which guarantee that for all i with latency

16

y+i = AllReduce(yi/n, i) ⌘1

n

nX

i=1

yi 8i

y+i = DistributedAverage(yi, i)

ky+i � 1n

nX

i=1

yik �

µ

Gossip Algorithms For a doubly-stochastic matrix with consider (synchronous) linear iterations

Then and if

17

W Wi,j > 0 , (i, j) 2 E

yi(k + 1) = Wi,iyi(k) +nX

j=1

Wi,jyj(k)

kyi(k)� yk �yi(k) ! ydef= 1

n

Pni=1 yi(0)

k �log

�1� ·

pn ·maxj kyj(0)� yk

1� �2(W )

ring expander

1

1� �2= O(1)

random geometric graph

1

1� �2= O(n2)

1

1� �2= O

✓n

log(n)

Gossip Algorithms For a doubly-stochastic matrix with consider (synchronous) linear iterations

Then and if

18

W Wi,j > 0 , (i, j) 2 E

yi(k + 1) = Wi,iyi(k) +nX

j=1

Wi,jyj(k)

kyi(k)� yk �yi(k) ! ydef= 1

n

Pni=1 yi(0)

k �log

�1� ·

pn ·maxj kyj(0)� yk

1� �2(W )

Related work: •  Tsitsiklis, Bertsekas, & Athans 1986 •  Nedic & Ozdaglar 2009 •  Ram, Nedic, & Veeravalli 2010 •  Duchi, Agarwal, & Wainwright 2012

Distributed Dual Averaging with Approximate Mini-Batches (DDA-AMB)

19

Initialize zi(1) = 0, wi(1) = 0

For t = 1, . . . , Tdef= d m

b+µe

gi(t) =n

b

b/nX

s=1

rf(wi(t), xi(t, s))

zi(t+ 1) = DistributedAverage(zi(t) + gi(t), i)

Algorithm parameters 0 < �(t) �(t+ 1)

wi(t+ 1) = arg minw2W

�hzi(t+ 1), wi+ �(t)h(w)

Strongly convex aux. function E.g., . kwk22

Distributed Dual Averaging with Approximate Mini-Batches (DDA-AMB)

20

Initialize zi(1) = 0, wi(1) = 0

For t = 1, . . . , Tdef= d m

b+µe

gi(t) =n

b

b/nX

s=1

rf(wi(t), xi(t, s))

zi(t+ 1) = DistributedAverage(zi(t) + gi(t), i)

zi(t+ 1) ⇡ 1

n

nX

i=1

�zi(t) + gi(t)

= z(t) +1

b

nX

i=1

b/nX

s=1

rf(wi(t), xi(t, s))

Should give

b samples

μ samples

wi(t+ 1) = arg minw2W

�hzi(t+ 1), wi+ �(t)h(w)

When do Approximate Mini-Batches Work?

21

If G is an expander, then and so . k = ⇥(log n) µ = ⇥(log n)

Latency is the same (order-wise) as aggregating along a tree.

Theorem (Tsianos & MR): Run DDA-AMB with

k =

log

�2

pn(1 + 2Lb)

1� �2(W )

iterations of gossip per mini-batch and �(t) = K +

qt

b+µ , and

take b = m⇢for ⇢ 2 (0, 1

2 ). Then E[Rn(m)] = O(

pm).

Stochastic Optimization

22

Consider the problem

Well-known that

minimize F (w) = Ex

[f(w, x)]

subject to w 2 W

F (w(m))� F (w⇤) 1

mE[R1(m)]

w(m) =1

m

mX

t=1

w(t)where

Distributed Stochastic Optimization

23

Accuracy is guaranteed if F (wi(T ))� F (w⇤) ✏ T � 1

n· 1

✏2

Total gossip iterations: O✓

1

✏2· log n

n· 1

1� �2(W )

Agarwal & Duchi (2011) obtain similar rates with an asynchronous master-worker architecture

Corollary: Run DDA-AMB with �(t) = K +

qtb and

k =

log

�(1 + 2Lb)

pn�

1� �2(W )

gossip iterations per mini-batch of b gradients processed across the network.

Then

F (wi(dmb e))� F (w⇤

) = O(

1pm) = O(

1pnT

) .

Experimental Evaluation

24

Asynchronous version of distributed dual averaging using Matlab’s Distributed Computing Server (wraps to MPI) on a cluster with n=64 processors

Solve a multinomial logistic regression task

Data used: MNIST digits

y 2 {0, 1, . . . , 9}x 2 R784

w 2 R7850

f(w, (x, y)) =

1

Z(w, (x, y))

exp

8<

:wy,d+1 +

dX

j=1

wy,dxd

9=

;

Performance scaling with more nodes

25

0 500 1000 1500 2000 25000

0.5

1

1.5

2

2.5

Run Time (s)

1 mR

n(m

)

n = 4n = 8n = 16n = 32n = 64

Performance scaling

26 0 1 2 3 4 5 6

x 104

0.5

1

1.5

2

2.5

3

T

R4(T

)/R

n(T

)

n=8n=16n=32

For fixed b, have E[Rn(T )] = O(

pTn)

Expect that

Rn(T )

Rn0(T )

⇡r

n

n0

p2 ⇡ 1.4

p4 = 2

p8 ⇡ 2.8

Conclusions

•  Exact averaging is not crucial for regret with distributed mini-batches –  Just need to ensure nodes don’t drift too far apart

•  Current gossip bounds are worst-case in initial condition –  Potentially use an adaptive rule to gossip less

•  Fully asynchronous version is straightforward extension

•  Open problem: –  Lower bounds for distributed optimization (communication +

computation) –  Exploiting sparsity in distributed optimization

27

O(pm)