Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online...

Consensus-Based Distributed Online Prediction and Stochastic Optimization

Michael Rabbat

Joint work with �� Konstantinos Tsianos

Stochastic Online Prediction

Observe x(1)x(2) x(3) . . .

w(1) w(2) w(3) . . .

f(w(1), x(1)) f(w(2), x(2)) f(w(3), x(3))

Predict

Suffer Loss

x(t)Assume drawn i.i.d. from unknown distribution

R(m) =mX

f(w(t), x(t))�mX

f(w⇤, x(t))Regret:

where w⇤ = arg minw2W

[f(w, x)]

See, e.g., S. Shalev-Shwartz, “Online Learning and Online Convex Optimization, FnT 2012.

Example Application: Internet Advertising

w(t)Prediction

sample/data x(t)

Problem Formalization

Assume: |f(w, x)� f(w

0, x)| Lkw � w

0k (Lipschitz continuous)

krf(w, x)�rf(w

0, x)k Kkw � w

0k (Lipschitz cont. gradients)

f(w, x) is convex in w for all x

Best possible performance (Nemirovsky & Yudin ‘83) is Achieved by many algorithms including Nesterov’s Dual Averaging

R(m) =mX

f(w(t), x(t))�mX

f(w⇤, x(t))Regret:

E⇥krf(w, x)� E[rf(w, x)]k2

⇤ �

2(Bounded variance)

E[R(m)] = O(pm)

Dual Averaging Intuition

F (w) = Ex

[f(w, x)]

(w1, F (w1))

hrF (w1), w � w1i

F (w) = Ex

[f(w, x)]

F (w) = Ex

[f(w, x)]

Lower linear model

l(w) =

Pki=1 �i(f(wi) + hrF (wi), w � wii)Pk

i=1 �i

Algorithm (Dual Averaging)

Initialize: z(1) = 0 2 Rd, w(1) = 0

Repeat (for t > 1):

1. g(t) = rf(w(t), x(t))

2. z(t+ 1) = z(t) + g(t)

3. w(t+ 1) = argminw2W

hz(t+ 1), wi+ 1a(t)kwk

The Dual Averaging Algorithm

Nesterov (2009); see also Zinkevich (2003)

Gradient aggregation

“Projection”

Algorithm (Dual Averaging)

Initialize: z(1) = 0 2 Rd, w(1) = 0

Repeat (for t > 1):

1. g(t) = rf(w(t), x(t))

2. z(t+ 1) = z(t) + g(t)

3. w(t+ 1) = argminw2W

hz(t+ 1), wi+ 1a(t)kwk

The Dual Averaging Algorithm

Nesterov (2009); see also Zinkevich (2003)

Gradient aggregation

“Projection”

Best possible performance (Nemirovsky & Yudin ‘83, Cesa-Bianchi & Lugosi ‘06)

Theorem (Nesterov ’09): For a(t) = 1/pt we have E[R(m)] = O(

E[R(m)] = O(pm)

Distributed Online Prediction

w1(1), w1(2), w1(3), . . .

x1(1), x1(2), x1(3), . . . w2(1), w2(2), w2(3), . . .

x2(1), x2(2), x2(3), . . .

w3(1), w3(2), w3(3), . . .

x3(1), x3(2), x3(3), . . .

wn�1(1), wn�1(2), wn�1(3), . . .

xn�1(1), xn�1(2), xn�1(3), . . .wn(1), wn(2), wn(3), . . .

xn(1), xn(2), xn(3), . . .

Regret: Rn(m) =

⇥f(wi(t), xi(t))� f(w⇤

, x(t))⇤

Communication topology given by graph G = (V,E)

Distributed Online Prediction

“No collaboration”

Regret: Rn(m) = nR1(mn )

= O(pnm)

With collaboration…

Rn(m) R1(m) = O(pm)

How to achieve this bound?

Mini-batch Updates

[O. Dekel, R. Gilad-Bachrach, O. Shamir, L. Xiao, JMLR 2012]

Observe x(1)

f(w(1), x(1))

Predict

Suffer Loss

f(w(1), x(2)) f(w(1), x(b)) f(w(b+ 1), x(b+ 1))

w(b+ 1)

x(b+ 1)x(b)

w(b) = w(1)

Update using average gradient

Regret: E[R1(m)] = O(b+pm+ b)

Update after mini-batch of b samples

rf(w(1), x(t))

Distributed Mini-Batch Algorithm

Distribute each mini-batch of b samples across n nodes Aggregate (synchronously) along a spanning tree (e.g., using ALLREDUCE)

All nodes exactly compute 1

rf(w(1), x(t))

Collaborating has latency samples µ

Regret: Rn(m) =

mb+µX

b+µnX

[f(wi(t), xi(t, s))� f(w⇤, xi(t, s))]

Distributed Mini-Batch Algorithm

Distribute each mini-batch of b samples across n nodes Aggregate (synchronously) along a spanning tree (e.g., using ALLREDUCE)

Is exact average gradient computation necessary? Can we achieve the same rates with asynchronous distributed algorithms?

All nodes exactly compute 1

rf(w(1), x(t))

Achieve optimal regret E[Rn(m)] = O(pm)

with appropriate choice of b

Approximate Distributed Averaging

ALLREDUCE is an example of an exact distributed averaging protocol

y+i = AllReduce(yi/n, i) ⌘1

Approximate Distributed Averaging

ALLREDUCE is an example of an exact distributed averaging protocol

More generally, consider approximate distributed averaging protocols which guarantee that for all i with latency

y+i = AllReduce(yi/n, i) ⌘1

y+i = DistributedAverage(yi, i)

ky+i � 1n

yik �

Gossip Algorithms For a doubly-stochastic matrix with consider (synchronous) linear iterations

Then and if

W Wi,j > 0 , (i, j) 2 E

yi(k + 1) = Wi,iyi(k) +nX

Wi,jyj(k)

kyi(k)� yk �yi(k) ! ydef= 1

Pni=1 yi(0)

k �log

�1� ·

pn ·maxj kyj(0)� yk

1� �2(W )

ring expander

1� �2= O(1)

random geometric graph

1� �2= O(n2)

1� �2= O

log(n)

Gossip Algorithms For a doubly-stochastic matrix with consider (synchronous) linear iterations

Then and if

W Wi,j > 0 , (i, j) 2 E

yi(k + 1) = Wi,iyi(k) +nX

Wi,jyj(k)

kyi(k)� yk �yi(k) ! ydef= 1

Pni=1 yi(0)

k �log

�1� ·

pn ·maxj kyj(0)� yk

1� �2(W )

Related work: •  Tsitsiklis, Bertsekas, & Athans 1986 •  Nedic & Ozdaglar 2009 •  Ram, Nedic, & Veeravalli 2010 •  Duchi, Agarwal, & Wainwright 2012

Distributed Dual Averaging with Approximate Mini-Batches (DDA-AMB)

Initialize zi(1) = 0, wi(1) = 0

For t = 1, . . . , Tdef= d m

gi(t) =n

rf(wi(t), xi(t, s))

zi(t+ 1) = DistributedAverage(zi(t) + gi(t), i)

Algorithm parameters 0 < �(t) �(t+ 1)

wi(t+ 1) = arg minw2W

�hzi(t+ 1), wi+ �(t)h(w)

Strongly convex aux. function E.g., . kwk22

Distributed Dual Averaging with Approximate Mini-Batches (DDA-AMB)

Initialize zi(1) = 0, wi(1) = 0

For t = 1, . . . , Tdef= d m

gi(t) =n

rf(wi(t), xi(t, s))

zi(t+ 1) = DistributedAverage(zi(t) + gi(t), i)

zi(t+ 1) ⇡ 1

�zi(t) + gi(t)

= z(t) +1

rf(wi(t), xi(t, s))

Should give

b samples

μ samples

wi(t+ 1) = arg minw2W

�hzi(t+ 1), wi+ �(t)h(w)

When do Approximate Mini-Batches Work?

If G is an expander, then and so . k = ⇥(log n) µ = ⇥(log n)

Latency is the same (order-wise) as aggregating along a tree.

Theorem (Tsianos & MR): Run DDA-AMB with

pn(1 + 2Lb)

1� �2(W )

iterations of gossip per mini-batch and �(t) = K +

b+µ , and

take b = m⇢for ⇢ 2 (0, 1

2 ). Then E[Rn(m)] = O(

Stochastic Optimization

Consider the problem

Well-known that

minimize F (w) = Ex

[f(w, x)]

subject to w 2 W

F (w(m))� F (w⇤) 1

mE[R1(m)]

w(m) =1

w(t)where

Distributed Stochastic Optimization

Accuracy is guaranteed if F (wi(T ))� F (w⇤) ✏ T � 1

Total gossip iterations: O✓

✏2· log n

1� �2(W )

Agarwal & Duchi (2011) obtain similar rates with an asynchronous master-worker architecture

Corollary: Run DDA-AMB with �(t) = K +

qtb and

�(1 + 2Lb)

1� �2(W )

gossip iterations per mini-batch of b gradients processed across the network.

F (wi(dmb e))� F (w⇤

) = O(

1pm) = O(

Experimental Evaluation

Asynchronous version of distributed dual averaging using Matlab’s Distributed Computing Server (wraps to MPI) on a cluster with n=64 processors

Solve a multinomial logistic regression task

Data used: MNIST digits

y 2 {0, 1, . . . , 9}x 2 R784

w 2 R7850

f(w, (x, y)) =

Z(w, (x, y))

:wy,d+1 +

wy,dxd

Performance scaling with more nodes

0 500 1000 1500 2000 25000

Run Time (s)

n = 4n = 8n = 16n = 32n = 64

Performance scaling

26 0 1 2 3 4 5 6

n=8n=16n=32

For fixed b, have E[Rn(T )] = O(

Expect that

Rn(T )

Rn0(T )

p2 ⇡ 1.4

p4 = 2

p8 ⇡ 2.8

Conclusions

•  Exact averaging is not crucial for regret with distributed mini-batches –  Just need to ensure nodes don’t drift too far apart

•  Current gossip bounds are worst-case in initial condition –  Potentially use an adaptive rule to gossip less

•  Fully asynchronous version is straightforward extension

•  Open problem: –  Lower bounds for distributed optimization (communication +

computation) –  Exploiting sparsity in distributed optimization

Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online...

Documents

Transcript of Consensus-Based Distributed Online Prediction and ...€¦ · Consensus-Based Distributed Online...

Improving the Crystal Ball: Consumer Consensus and Retail ... · PDF file1 Improving the Crystal Ball: Consumer Consensus and Retail Prediction Markets Carol Kaufman-Scarborough, Maureen

T. N. Krishnamurti - National Hurricane Center · SPICE (Statistical Prediction of Intensity from a Consensus Ensemble) has been developed as a combination of the official SHIPS and

1. Introduction. consensus consensus protocol

Consensus Delegate Handbook (Feb 26)v1.iotex.io/consensus-delegate-handbook.pdf · 2019. 7. 26. · Consensus DelegateHandbook •Consensus Delegate Overview •Minimum Requirements

Balanced Insight Consensus® · • Consensus Database Migration Utility The Balanced Insight Consensus® Easy Install is ideal for customers looking to deploy Balanced Insight Consensus®

The Consensus Problem...Consensus is important • With consensus, you can implement anything you can imagine… • Examples: with consensus you can decide on a leader, implement

Christian Rabbat PORTFOLIO-LBL-Saramac

A Turbine Hub Height Wind Speed Consensus Forecasting ......3/2/2011 A Turbine Hub Height Wind Speed Consensus Forecasting System NCAR’s Wind Energy Prediction System Developed for

Consensus Prediction of Protein Secondary Structuresorca.st.usm.edu/~zwang/thesis/ZhengWang_Mythesis.pdf · Consensus Prediction of Protein Secondary Structures by ... A THESIS SUBMITTED

Collaborators: Mark Coates, Rui Castro, Ryan King, Mike Rabbat, Yolanda Tsang, Vinay Ribeiro, Shri Sarvotham, Rolf Reidi Network Bandwidth Estimation and.

Getting to We! Consensus Decision-Making. What is Consensus?

Washington Consensus

THE CONSENSUS ON THE CONSENSUS

Individual Response Consensus ENTRY TASK: CONSENSUS BOARD.

TThe ‘Crimean consensus’ he ‘Crimean consensus’ iis Dead ... · #8 (27) • 2019 TThe ‘Crimean consensus’ he ‘Crimean consensus’ iis Dead in Russias Dead in Russia

Prediction of Sediment Toxicity Using consensus-Based ... · PECs (PECs for metals, total polycyclic aromatic hydrocarbons (PAHs), total polychlorinated biphenyls (PCBs), and sum

On Conflict and Consensus: a Handbook on Formal Consensus Decisionmaking

Probabilistic Preference Learning with the Mallows Rank ModelVitelli, S˝rensen, Crispino, Frigessi and Arjas speci c consensus rankings. Section 4.4 is dedicated to prediction in

Pan European Consensus Meeting on Stroke Management · Pan European Consensus Meeting on Stroke Management Helsingborg, ... Pan European Consensus Meeting On Stroke ... a consensus

Operational and Statistical Prediction of Rapid Intensity ... · Consensus) were derived for all RI thresholds for the 10-year 2004-2013 sample. • The new independent cross-validated