A arXiv:2110.03313v1 [cs.LG] 7 Oct 2021

Distributed Methods with Compressed Communication for Solving Variational Inequalities

DISTRIBUTED METHODS WITH COMPRESSED COMMU-NICATION FOR SOLVING VARIATIONAL INEQUALITIES,WITH THEORETICAL GUARANTEES∗

A. BeznosikovMIPT†, HSE, YandexRussia

P. RichtárikKAUST‡Saudi Arabia

M. DiskinHSE, YandexRussia

M. RyabininHSE, YandexRussia

A. GasnikovMIPT, HSE§Russia

ABSTRACT

Variational inequalities in general and saddle point problems in particular are in-creasingly relevant in machine learning applications, including adversarial learning,GANs, transport and robust optimization. With increasing data and problem sizesnecessary to train high performing models across these and other applications, it isnecessary to rely on parallel and distributed computing. However, in distributedtraining, communication among the compute nodes is a key bottleneck during train-ing, and this problem is exacerbated for high dimensional and over-parameterizedmodels models. Due to these considerations, it is important to equip existingmethods with strategies that would allow to reduce the volume of transmittedinformation during training while obtaining a model of comparable quality. In thispaper, we present the first theoretically grounded distributed methods for solvingvariational inequalities and saddle point problems using compressed communica-tion: MASHA1 and MASHA2. Our theory and methods allow for the use of bothunbiased (such as Randk; MASHA1) and contractive (such as Topk; MASHA2) com-pressors. We empirically validate our conclusions using two experimental setups: astandard bilinear min-max problem, and large-scale distributed adversarial trainingof transformers.

1 INTRODUCTION

1.1 THE EXPRESSIVE POWER OF VARIATIONAL INEQUALITIES

Due to their abstract mathematical nature and the associated flexibility they offer in modeling variouspractical problems of interests, variational inequalities (VI) have been an active area of researchin applied mathematics for more than half a century (Korpelevich, 1976; Harker & Pang., 1990;Facchinei & Pang, 2003). It is well known that VIs can be used to formulate and study convexoptimization problems, convex-concave saddle point problems and games, for example, in an elegantunifying mathematical framework (Korpelevich, 1976; Bauschke & Combettes, 2017).

Recently, Gidel et al. (2019) pointed out that multi-player games can be cast as VIs, and proposedto study mini-max or non-zero-sum games formulations of GANs (Goodfellow et al., 2014) in thisfashion. This allowed them to successfully transfer established insights and well-known techniquesfrom the vast literature on VIs to the study of GANs. In particular, oscillatory behavior of optimizationmethods (such as SGD) not originally designed to solve VI problems is well understood in the VIliterature, and established tools, such as averaging and extrapolation, can be successfully applied tothe training of GANs. Besides their usefulness in studying GANs and alternative adversarial learningmodels (Madry et al., 2018), VIs have recently attracted considerable attention of the machine learning

∗The research of Aleksandr Beznosikov and Alexander Gasnikov was supported by the Russian ScienceFoundation project No. 21-71-30005. The research of Peter Richtárik was supported by the KAUST ResearchBaseline Funding Scheme.†Moscow Institute of Physics and Technology, Dolgoprudny, Russia.‡King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.§National Research University Higher School of Economics, Moscow, Russia.

1

arX

iv:2

110.

0331

3v1

[cs

.LG

] 7

Oct

202

1


community due to their ability to model other situations where the minimization of a single lossfunction does not suffice, such as auction theory (Syrgkanis et al., 2015) and robust and multi-agentreinforcement learning (Pinto et al., 2017).

In summary, VIs have recently become a potent tool enabling new advances in practical machinelearning situations reaching beyond supervised learning where optimization problems and techniques,which can be seen as special instances of VIs and methods for solving them, reign supreme.

1.2 TRAINING OF SUPERVISED MODELS VIA DISTRIBUTED OPTIMIZATION

On the other hand, in the domain of classical, and hence also much better understood, supervisedmachine learning characterized by the fact that standard optimization techniques apply and work well,researchers and practitioners face other challenges that are currently beyond the reach of existingVI methods. Indeed, the training of modern supervised machine learning models in general, anddeep neural networks in particular, is still extremely challenging. Due to their desire to improve thegeneralization of deployed models, machine learning engineers need to rely on training datasets ofever increasing sizes and on elaborate over-parameterized models (Arora et al., 2018). Supportingworkloads of such unprecedented magnitudes would be impossible without combining the latestadvances in hardware acceleration, distributed systems and distributed algorithm design (Verbraekenet al., 2019).

When training such modern supervised models in a distributed fashion, communication cost is oftenthe bottleneck of the training system, and for this reason, a lot of effort was recently targeted at thedesign of communication efficient distributed optimization methods (Konecný et al., 2016; Smithet al., 2018; Ghosh et al., 2020; Gorbunov et al., 2021). A particularly successful technique forimproving the communication efficiency of distributed first order optimization methods is commu-nication compression. The idea behind this technique is rooted in the observation that in practicalimplementations it often advantageous to communicate messages compressed via (often randomized)lossy compression techniques instead of communicating the full messages (Seide et al., 2014; Alistarhet al., 2017). If the number of parallel workers is large enough, the noise introduced by compressionis reduced, and training with compressed communication will often lead to the comparable test errorwhile reducing the amount of communicated bits, which results in faster training, both in theory andpractice (Mishchenko et al., 2019; Gorbunov et al., 2021).

1.3 TWO CLASSES OF COMPRESSION OPERATORS

We say that a (possibly) stochastic mapping Q : Rd → Rd is an unbiased compression operator ifthere exists a constant q ≥ 1 such that

EQ(z) = z, E‖Q(z)‖2 ≤ q‖z‖2, ∀z ∈ Rd. (1)

Further, we say that a stochastic mapping C : Rd → Rd is a contractive compression operator ifthere exists a constant δ ≥ 1 such that

E‖C(z)− z‖2 ≤ (1− 1/δ)‖z‖2, ∀z ∈ Rd. (2)

If b is the number of bits needed to represent a single float (e.g., b = 32 or b = 64), then the numberof bits needed to represent a generic vector z ∈ Rd is ‖z‖bits := bd. To describe how much acompression operator reduces its input vector on average, we introduce the notion of expected density,defined via β := 1

bdE‖Q(z)‖bits, where ‖Q(z)‖bits denotes the number of bits needed to represent thequantized vector Q(z). Note that β ≤ 1. For the Randk operator (Alistarh et al., 2018; Beznosikovet al., 2020) we have q = d/k and β = k/d.

1.4 TOWARDS COMMUNICATION-EFFICIENT DISTRIBUTED METHODS FOR VIS

While classical VI algorithms, such as the extragradient method originally proposed by Korpelevich(1976) and later studied by many authors (Nemirovski, 2004; Juditsky et al., 2008), were not designedto work in a distributed environment, virtually all methods that were (Yuan et al., 2014; Hou et al.,2021; Deng & Mahdavi, 2021; Beznosikov et al., 2021b;c) do not consider the general VI problem,but tackle the special case of saddle-point problems only. Moreover, none of these distributed methodssupport communication compression, with the exception of the work of Yuan et al. (2014), which

2


Table 1: An overview of existing methods and their high-level properties. We develop the first provablycommunication-efficient (via communication compression) algorithms for solving distributed VIproblems.

ReferenceSolves

generalVIs?

Supportsdistributed

setup?

SupportsCompressed

Communication?

Hasstrong

theory?Korpelevich (1976) 3 7 7 3Nemirovski (2004) 3 7 7 3

Goodfellow et al. (2014) 7(8) 7 7 7(9)

Yuan et al. (2014) 7(1) 3(6) 7(7) 7(4)

Hou et al. (2021) 7(1) 3 7(3) 7(5)

Deng & Mahdavi (2021) 7(1) 3 7(3) 7(5)

Beznosikov et al. (2021b) 7(1) 3 7(2) 3

Beznosikov et al. (2021c) 7(1) 3 7(2) 3

Beznosikov et al. (2021a) 3 3(6) 7(2) 3This work 3 3 3 3

(1) Studies Saddle-Point Problems (SPPs) only.(2) Achieves communication efficiency without compression but by assuming data similarity/homogeneity; i.e., by dramatically re-stricting the problem class.(3) Tries to achieve communication efficiency via performing local steps as popular in Federated Learning (Konecný et al., 2016;McMahan et al., 2016; Li et al., 2020). Employs local methods (such as GD and SCAFFOLD (Karimireddy et al., 2020)).(4) Does not achieve convergence to the solution; and the convergence criterion may not work simple bilinear saddle problems.(5) They build upon GD, which diverges for simple bilinear problems.(6) Uses decentralized architecture.(7) Compression just by rounding to nearest integer multiples of some constant.(8) Applies to GANs only.(9) No theory.

relies on rounding to the nearest integer multiple of a certain quantity. This compression mechanismdoes not offer theoretical benefits, and does not even lead to convergence to the solution since theerrors that are introduced through rounding persist and prevent the method from solving the problem.

2 SUMMARY OF CONTRIBUTIONS

In this paper, we investigate whether it is possible to design communication-efficient algorithmsfor solving distributed VI problems by borrowing generic communication compression techniques(1) and (2) from the optimization literature (Seide et al., 2014; Alistarh et al., 2017; Mishchenkoet al., 2019; Gorbunov et al., 2021; Richtárik et al., 2021) and adapting and embedding them intoestablished and efficient methods for solving VIs (Korpelevich, 1976; Nemirovski, 2004; Juditskyet al., 2008; Alacaoglu & Malitsky, 2021). Whether or not this is possible is an open problem.

In summary,

we design the first provably communication-efficient algorithms for solving generaldistributed VI problems (see Section 3, Equation 3) in the deterministic (see (4)) andstochastic (see (5)) regimes, supporting both unbiased (MASHA1 = Algorithm 1)and contractive (MASHA2 = Algorithm 2) compressors. Our methods are explicitlydesigned to be variance reduced to achieve better theoretical properties and betterpractical performance.

In Table 1 we give a high level overview of existing methods for VIs, and contrast them with ourmethods and results. We now elaborate a bit more:

2.1 TWO DISTRIBUTED PROBLEMS: DETERMINISTIC AND STOCHASTIC

We study two distributed VI problems: i) deterministic, where the monotone operator F : Rd → Rdfeatured in the VI is the average of M operators FmMm=1, where M is the number of de-vices/machines, which can be evaluated in each communication round, and ii) stochastic, whereeach monotone operator Fm : Rd → Rd has a finite-sum structure on its own, and only a single

3


Table 2: Strongly convex - strongly concave case: Summary of our iteration complexity results forfinding an ε-solution for problem (3) in the deterministic (i.e., (4)) and stochastic (i.e., (4)–(5)) setups.In the strongly convex - strongly convex case, convergence is measured by the distance to the solution.In the convex-concave case, convergence is measured in terms of the gap function. Notation: µ = sumof the coefficients µF and µh of strong monotonicity of the operator F and strong convexity of h, L= maximum of local Lipschitz constants Lm, R = diameter (in Euclidean norm) of the optimizationset, R0 = initial distance to the solution, q = the variance parameter associated with an unbiasedcompressor (see (1)); δ = the variance parameter associated with a contractive compressor (see (2));M = the number of parallel clients/nodes; r = the size of the local dataset (see (5)). To simplify thebounds, we assume that the expected density β of the compression operators satisfies q = 1/β andδ = 1/β (e.g., this holds for Randk and Topk).

Problem Algorithm Strongly convex - strongly concave case

(3)–(4) MASHA1

[1 + q + L

µ

√1 + q + q2

M

]· log R2

0ε

MASHA2

[1 + δ + δ3/2L

µ+ δ3L2

µ2

]· log R2

0ε

(3)–(5) MASHA1

[max(r, q + 1) + L

µ

√max(r, q + 1)

(1 + q

M

)]· log R2

0ε

MASHA2

[max(r, δ + 1) +

√max(r, q + 1) δL

µ+max(r, q + 1) δ

2L2

µ2

]· log R2

0ε

Problem Algorithm Convex-concave case

(3)–(4) MASHA1

√1 + q + q2

M· LR

2

ε

MASHA2 –

(3)–(5) MASHA1 max(r, δ + 1)√

1 + qM

· LR2

εMASHA2 –

operator in the sum can be evaluated in each iteration. In contrast to previous works, we study generalconstrained VIs in the distributed setup (see Section 3), and not merely saddle point problems.

2.2 TWO NEW METHODS WITH COMPRESSED COMMUNICATION: MASHA1 AND MASHA2

We develop two extensions of the extragradient / extrastep method of Korpelevich (1976) to distributedVIs depending on whether we use unbiased (1) or contractive (2) compressors, since each type ofcompressor demands a different algorithmic design and a different analysis. In particular, contractivecompressors are notoriously hard to analyze even for optimization problems (Karimireddy et al., 2019;Richtárik et al., 2021). Our method based on unbiased compressors is called MASHA1 (Algorithm 1),and our method based on contraction compressors is called MASHA2 (Algorithm 2). Both aredesigned to handle the deterministic and also the stochastic setting, and both are enhanced withbespoke variance-reduction techniques for better theoretical and practical performance. Due to spacerestrictions, we only describe MASHA1 in the main body of the paper, and relegate MASHA2 and theassociated theory to Appendix B.

2.3 THEORETICAL COMPLEXITY RESULTS

We establish a number of theoretical complexity results for our methods, which we summarize inTable 2. We consider the strongly convex - strongly concave regime as well as the more generalconvex - concave regime. In the first case we obtain linear convergence results (O(log 1/ε)) in termsof the distance to solution, and in the latter we obtain fast sublinear convergence results (O(1/ε)) interms of the gap function. The rates for MASHA1 improve with the number of clients M and haveweak dependence on the conditioning of the problem.

3 PROBLEM FORMULATION AND ASSUMPTIONS

Let us first introduce basic notation. We write 〈x, y〉 :=∑di=1 xiyi to denote the standard Euclidean

inner product of vectors x, y ∈ Rd. This induces `2-norm in Rd as usual: ‖x‖ :=√〈x, x〉. We also

4


introduce the proximal operator, defined as proxg(z) := arg minu∈Zg(u) + 12‖u− z‖

2, which iswell defined for proper lower semicontinuous convex functions g : Rd → R ∪ +∞.

3.1 PROBLEM FORMULATION

We study distributed variational inequality (VI) problems of the form

Find z∗ ∈ Z such that 〈F (z∗), z − z∗〉+ h(z)− h(z∗) ≥ 0, ∀z ∈ Z, (3)

where Z is a nonempty closed convex subset of Rd, F : Rd → Rd is an operator with certainfavorable properties (e.g., Lipschitzness and monotonicity), and h : Rd → R ∪ +∞ is a properlower semicontinuous convex function. We assume that the training data describing F is distributedacross M workers/nodes/clients

F (z) := 1M

M∑m=1

Fm(z), (4)

where Fm : Rd → Rd for all m ∈ 1, 2, . . . ,M, and for some results we further assume that Fm isof a finite-sum structure as well:

Fm(z) := 1r

r∑i=1

Fm,i(z). (5)

3.2 ASSUMPTIONS

Next, we list two key assumptions - both are standard in the literature on VIs.

Assumption 1 (Lipschitzness in mean) For all clients m = 1, 2, . . . ,M , the operator Fm(z) :Rd → Rd is Lipschitz in mean with constant Lm ≥ 0 on Z . That is,

1r

r∑i=1

‖Fm,i(z1)− Fm,i(z2)‖2 ≤ L2m‖z1 − z2‖2, ∀z1, z2 ∈ Z. (6)

For problem (4), Assumption 1 is to be interpreted to hold with r is equal 1, i.e., with Fm = Fm,1.

Assumption 2 (Monotonicity and convexity) (SM) Strong monotonicity/strong convexity. Thereexist non-negative constants µF , µh such that µh + µF > 0, and the following statements hold:

〈F (z1)− F (z2), z1 − z2〉 ≥ µF ‖z1 − z2‖2, ∀z1, z2 ∈ Z, (7)h(z1)− h(z2)− 〈∇h(z2), z1 − z2〉 ≥ µh

2 ‖z1 − z2‖2, ∀z1, z2 ∈ Z. (8)

(M) Monotonicity/convexity.

〈F (z1)− F (z2), z1 − z2〉 ≥ 0, h(z1)− h(z2)− 〈∇h(z2), z1 − z2〉 ≥ 0, ∀z1, z2 ∈ Z. (9)

4 MASHA1: HANDLING UNBIASED COMPRESSORS

In this section we present only one of our two new algorithms: MASHA1 (Algorithm 1) - the methodthat relies on unbiased compressors. Due to lack of space, we include MASHA2 (Algorithm 2) - themethod that relies on contractive compressors in the appendix. Both algorithms are presented fortwo modes: deterministic (4) and stochastic (4)–(5). We denote the lines related to the deterministicregime in blue, and in red - to the stochastic one. Black lines are common to both modes.

4.1 THE ALGORITHM

MASHA1 is a modification of the extra-step method. At the beginning of each iteration of Algorithm 1,each device knows the value of F (wk), hence it can calculate zk and zk+1/2 locally. Then it sendsthe compressed version of the difference Fm(zk+1/2)− Fm(wk) to the server, and the server does areverse compressed broadcast. As a result, an unbiased estimate of Fm(zk+1/2)− Fm(wk) appearson each node. Then, all devices make a final update on zk+1, and also update the wk+1 = zk+1

point if the bk = 1 or save it from the previous iteration wk+1 = wk. In the case when the point

5


wk+1 = zk+1, we need to exchange the full values of Fm(wk+1) so that at the beginning of the nextiteration the value of F (wk+1) is known to all nodes. In the stochastic case, Algorithm 1 has thesame form, the only thing that changes is that one need to generate a function number (batch number)from 1 to r. We use a possibly difference compressor on each device and also on the server. Todistinguish between them, we will use the following notation: Qdev

m , qdevm , βdev

m and Qserv, qserv, qserv.Note that, if Q is the identity quantization, i.e., Q(x) = x, then MASHA1 is a distributed analogue ofthe method from Alacaoglu & Malitsky (2021).

Algorithm 1 MASHA1 (handling unbiased compressors)

Parameters: Stepsize γ > 0, number of iterations K.Initialization: Choose z0 = w0 ∈ Z .Server sends to devices z0 = w0 and devices compute Fm(w0) and send to server and get F (w0)for k = 0, 1, 2, . . . ,K − 1 do

for each device m in parallel dozk = τzk + (1− τ)wk

zk+1/2 = proxγh(zk − γF (wk)),Generate πkm from 1, . . . , r independentlyCompute Fm(zk+1/2) & send Qdev

m (Fm(zk+1/2)− Fm(wk)) to serverCompute Fm,πk

m(zk+1/2) & send Qdev

m (Fm,πkm

(zk+1/2)− Fm(wk)) to serverend forfor server do

Compute Qserv[

1M

M∑m=1

Qdevm (Fm(zk+1/2)− Fm(wk))

]& send to devices

Compute Qserv[

1M

M∑m=1

Qdevm (Fm,πk

m(zk+1/2)− Fm,πk

m(wk))

]& send to devices

Sends to devices one bit bk: 1 with probability 1− τ , 0 with with probability τend forfor all devices in parallel do

zk+1 = proxγh

(zk − γQserv

[1M

M∑m=1


]− γF (wk)

)zk+1 = proxγh

(zk − γQserv

[1M

M∑m=1

Qdevm (Fm,πk

m(zk+1/2)− Fm,πm

k(wk))

]− γF (wk)

)if bk = 1 then

wk+1 = zk+1

Compute Fm(wk+1) & send it to server; and get F (wk+1)else

wk+1 = wk

end ifend for

end for

It is important to note main differences from minimization problems. For minimization problems,compressed gradient methods are constructed on the gradient descent or the accelerated gradientdescent methods. Here, the extragradient method is taken as a basis. In the experiments, we will seethe importance of this fact, i.e., gradient descent type methods will diverge even on simple problems.The second key difference is that we need not compress Fm(zk+1/2) itself, but the differenceFm(zk+1/2)−F (wk). We will also see the importance of this approach in the experiments. The lateridea is similar to the approach used in the DIANA, VR-DIANA (Mishchenko et al., 2019; Horváth et al.,2019), MARINA (Gorbunov et al., 2021) and EF21 (Richtárik et al., 2021) methods in optimization.

4.2 THEORY

We now establish convergence of MASHA1 in both regimes. Our analysis relies on the followingLyapunov function:

Vk := τ‖zk − z∗‖2 + ‖wk − z∗‖2, (10)

6


This criterion is used in the strongly monotone case. For the general monotone case, anotherconvergence criterion is used - the gap function:

Gap(z) := supu∈C

[〈F (u), z − u〉+ h(z)− h(u)] . (11)

Here we do not take the maximum over the entire set Z (as in the classical version), but over C – acompact subset of Z . Thus, we can also consider unbounded sets Z . This is permissible, since such aversion of the criterion is valid if the solution z∗ lies in C; for details see the work of Nesterov (2007).

Theorem 1 (Convergence of MASHA1) Let Assumption 1 be satisfied. Then, if one of cases fromAssumption 2 is additionally fulfilled, the following estimates hold for the iterates of MASHA1:

• for strongly-monotone/convex case with stepsize 0 < γ ≤ min√

1−τ4Cq

; 1−τ4(µF+µh)

and Cq :=√

qserv

M2

∑Mm=1(qdev

m +M − 1)L2m:

E [VK ] ≤(1− γ · µF+µh

16

)K−1 · V0,• for monotone/convex case with γ ≤

√1−τ4Cq

:

E[Gap(zK)

]≤ 8maxu∈C[‖z0−u‖2]

γK , where zK = 1K

K−1∑k=0

zk+1/2.

For proof, see Appendix A

An important issue of convergence is the choice of τ . If τ = 0 we have the fastest convergence rate,on the other hand, this means that we must send full Fm in each iteration (because we update wk).This is disadvantageous from a communication point of view. The next corollaries give the rulesfor the right choice of τ , as well as the iterative (in deterministic case), oracle (in stochastic) andcommunication complexities of Algorithm 1 in both cases. The method without quantization (forexample, the ordinary distributed extragradient method) in one iteration transmits the number ofinformation proportional to Ω(bdM) bits, then here we measure the communication complexity in interms of Ω(bdM).

Corollary 1 (Convergence of MASHA1 in the deterministic case) Let the deterministic problem(3)–(4) be solved by MASHA1 with precision ε. Let Assumption 1, and one of the two cases fromAssumption 2 be satisfied. If we choose

τ = 1− β := 1− 12

(1M

M∑m=1

βdevm + βserv

), (12)

then we have the following estimates for the total # of iterations and the total # of transferred bits:

• in the strongly-monotone/convex case:

O([

1β +

Cq√β(µF+µh)

]log 2‖z0−z∗‖2

ε

)iters, O

([1 +

√βCq

µF+µh

]log 2‖z0−z∗‖2

ε

)bits,

• in the monotone/convex case

O(Cq maxu∈C[‖z0−u‖2]√

βε

)iters, O

(√βCq maxu∈C[‖z0−u‖2]

ε

)bits.

One can see that our Algorithm 1 can outperform the uncompressed extragradient method.Let us compare them in the monotone case. The communication complexity of the extragra-dient method is O

(Lmaxu∈C

[‖z0 − u‖2

]ε−1). Let us consider the case when Lm = L

for all m ∈ 1, 2, . . . ,M and qdevm = q. Then MASHA1 has communication complexity

O(Lmaxu∈C

[‖z0 − u‖2

]ε−1√βqserv(1+q/M)

). If βqserv(1+q/M) < 1, we outperform the stan-

dard uncompressed extragradient method.

7


Corollary 2 (Convergence of MASHA1 in the stochastic case) Let the stochastic problem (3)–(5)be solved by MASHA1 with precision ε. Let Assumption 1 and one of the two cases from Assumption 2be satisfied. If we choose

τ = 1−min(1r ;β)

then we have the following estimates for the total # of oracle calls and the total # of transferred bits:

• in the strongly-monotone/convex case:

O([

max(r; 1β

)+ max

(√r; 1√

β

)Cq

(µF+µh)

]log 2‖z0−z∗‖2

ε

)calls,

O([

max (1; rβ) + max(√

rβ2;√β)

Cq

µF+µh

]log 2‖z0−z∗‖2

ε

)bits,

• in the monotone/convex case

O(

max(r; 1β

)Cq maxu∈C[‖z0−u‖2]

ε

)calls, O

(max

(√rβ2;√β)Cq maxu∈C[‖z0−u‖2]

ε

)bits.

In the case of a finite sum, in addition to the number of transmitted information, it is not iterations areimportant to us, but the number of calls to the oracle for Fm,r. This is due to the fact that it is r timesmore expensive to calculate full Fm, and the calculation of Fm should be avoided. For deterministiccase (when r = 1) estimates from Corollary 2 is the same as in Corollary 1.

5 EXPERIMENTS

5.1 BILINEAR SADDLE POINT PROBLEM

We start our experiments with a distributed bilinear problem:

minx∈Rd

maxy∈Rd

f(x, y) := 1M

M∑m=1

fm(x, y), fm(x, y) := x>Amy + a>mx+ b>my + λ2 ‖x‖

2 − λ2 ‖y‖

2,

(13)where Am ∈ Rd×d, am, bm ∈ Rd. This is a saddle point problem, and F is written as follows:

F (x, y) := (∇xf(x, y);−∇yf(x, y)).

This operator is λ strongly monotone and, moreover, all functions Fm are (‖Am‖2 + λ)-Lipschitz.Therefore, such a distributed problem is well suited for the primary comparison of our methods. Wetake d = 100 and generate positive definite matrices Am and vectors am, bm randomly.

The purpose of the experiment is to understand whether the MASHA1 and MASHA2 methods aresuperior to those in the literature. As a comparison, we take QGD with natural dithering Horvathet al. (2019), classical error feedback with Top 30% compression, as well as an extra step method,each step of which is used with natural rounding. In MASHA1 (Algorithm 1) we also used naturaldithering, in MASHA2 (Algorithm 2) – Top30%. See results in Figure 1.

Figure 1: Comparison MASHA1 (Algorithm 1) and MASHA2 (Algorithm 2) with existing methods initerations and in Mbytes.

0 2000 4000 6000 8000 10000Iterations

10 5

10 2

101

104

107

1010

1013

1016

Solu

tion

accu

racy

Algorithm 1Algorithm 2SGDError feedbackExtra step

0 5 10 15 20 25 30Transmitted data, Mbytes

10 5

10 2

101

104

107

1010

Solu

tion

accu

racy

Algorithm 1Algorithm 2QGDError feedbackExtra step

We see that methods based on gradient descent (QSG and EF) diverge. This confirms that oneneeds to use method specifically designed for saddle point problems (for example, the extragradientmethod), and not classical optimization methods. The much slower convergence of the quantizedextragradient method shows the efficiency of our approach in which we compress the differencesFm(zk+1/2)− F (wk).

8


5.2 ADVERSARIAL TRAINING OF TRANSFORMERS

We now evaluate how compression performs for variational inequalities (and for saddle point problems,as a special case) in a more practically motivated scenario. Indeed, saddle point problems (specialcase of variational inequalities) have ample applications in machine learning (see Appendix D),including in adversarial training. We consider one of these tasks and train a transformer-basedmasked language model (Vaswani et al., 2017; Devlin et al., 2019; Liu et al., 2019) using a fleet of 16low-cost preemptible workers with T4 GPU and low-bandwidth interconnect. For this task, we usethe compute-efficient adversarial training regimen proposed for transformers by Zhu et al. (2019);Liu et al. (2020). Formally, the adversarial formulation of the problem is the min-max problem

minw

max‖ρ1‖≤e,...‖ρN‖≤e

1N

N∑n=1

l(f(w, xn + ρn, yn)2 + λ2 ‖w‖

2 − β2 ‖ρ‖

2,

where w are the weights of the model, (xn, yn)Nn=1 are pairs of the training data, ρ is the so-calledadversarial noise which introduces a perturbation in the data, and λ and β are the regularizationparameters. To make our setup more realistic, we train ALBERT-large with layer sharing (Lan et al.,2020), which was recently shown to be much more communication-efficient during training (Ryabininet al., 2021; Diskin et al., 2021). We train our model on a combination of Bookcorpus and Wikipediadatasets with the same optimizer (LAMB) and parameters as in the original paper (Lan et al., 2020),use the adversarial training configuration of Zhu et al. (2019), and follow system design considerationsfor preemptible instances (Ryabinin et al., 2021).

In terms of communication, we consider 4 different setups for gradient compression: the “baseline”strategy with uncompressed gradients, full 8-bit quantization (Dettmers, 2015; Lin et al., 2018), mixed8-bit quantization, and Power compression (Vogels et al., 2019) with rank r=8. For mixed 8-bitquantization and Power we only apply compression to gradient tensors with more than 216 elements,sending smaller ones uncompressed. These small tensors represent layer biases and LayerNormscales (Ba et al., 2016) that collectively amount to≤ 1% of the total gradient, but can be more difficultto compress than regular weight tensors. Finally, since Power is a biased compression algorithm, weuse error feedback (Karimireddy et al., 2019; Richtárik et al., 2021) with a modified formulationproposed by Vogels et al. (2019). For all experimental setups, we report learning curves in termsof the model training objective, similarly to (Fedus et al., 2021; Ryabinin et al., 2021). To betterquantify the differences in training loss, we also evaluate the downstream performance for each modelon several popular tasks from (Wang et al., 2018) after each model was trained on approximately80 billion tokens. Finally, we measure the communication efficiency of each proposed strategy bymeasuring the average wall time per communication round when all 16 workers are active.

The learning curves in Figure 1 (upper) follow a predictable pattern, with more extreme compressiontechniques demonstrating slower per-iteration convergence. One curious exception to that is full 8-bitquantization, which was unable to achieve competitive training loss. The remaining three setupsconverge to similar loss values below 2. Both the baseline and mixed 8-bit compression show similarvalues in terms of downstream performance, with Power compression showing mild degradation. Butin terms of information transfer time, methods using compression (especially Power) are significantlysuperior to the method without compression. This makes it possible to use such techniques to increasethe training time without sacrificing quality.

9


Figure 2: (upper left) ALBERT training objective convergence rate with different compressionalgorithms; (upper right) ALBERT training objective convergence rate with different compressionalgorithms (zoomed); (lower) Average wall time per communication round with standard deviationover 5 repetitions and downstream evaluation scores on GLUE benchmark tasks after at 80 billiontraining tokens (≈104 optimizer steps).

0 20 40 60 80Billions of tokens

2

4

6

8

10

Obje

ctiv

e

uncompressedpure 8-bitmixed 8-bitpower

0 20 40 60 80Billions of tokens

2.0

2.5

3.0

3.5

4.0

4.5

5.0

Obje

ctiv

e

uncompressedpure 8-bitmixed 8-bitpower

Setup Averaging time CoLA MNLI MRPC QNLI QQP RTE SST2 STS-B WNLI

Baseline 8.789 ± 0.032 45.2 81.1 83.0 88.3 89.0 67.8 85.5 89.4 18.3Full 8-bit 4.422 ± 0.069 N/A N/A N/A N/A N/A N/A N/A N/A N/AMixed 8-bit 4.614 ± 0.08 48.8 81.3 88.7 88.1 85.2 64.3 88.3 87.5 16.9Power 1.571 ± 0.049 43.9 80.5 85.6 88.6 86.0 47.2 88.5 88.5 16.9

REFERENCES

Ahmet Alacaoglu and Yura Malitsky. Stochastic variance reduction for variational inequality methods. arXivpreprint arXiv:2102.08352, 2021.

Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. QSGD: Communication-efficientSGD via gradient quantization and encoding. In Advances in Neural Information Processing Systems, pp.1709–1720, 2017.

Dan Alistarh, Torsten Hoefler, Mikael Johansson, Sarit Khirirat, Nikola Konstantinov, and Cédric Renggli. Theconvergence of sparsified gradient methods. In Advances in Neural Information Processing Systems, 2018.

Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration byoverparameterization. In Proceedings of the 35th International Conference on Machine Learning (ICML),2018.

Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton. Layer normalization, 2016.

Heinz H. Bauschke and Patrick L. Combettes. Convex Analysis and Monotone Operator Theory in HilbertSpaces. Springer, second edition edition, 2017.

Aleksandr Beznosikov, Samuel Horváth, Peter Richtárik, and Mher Safaryan. On biased compression fordistributed learning. arXiv preprint arXiv:2002.12410, 2020.

Aleksandr Beznosikov, Pavel Dvurechensky, Anastasia Koloskova, Valentin Samokhin, Sebastian U Stich, andAlexander Gasnikov. Decentralized local stochastic extra-gradient for variational inequalities. arXiv preprintarXiv:2106.08315, 2021a.

Aleksandr Beznosikov, Valentin Samokhin, and Alexander Gasnikov. Distributed saddle-point problems: Lowerbounds, optimal algorithms and federated GANs. arXiv preprint arXiv:2010.13112, 2021b.

Aleksandr Beznosikov, Gesualdo Scutari, Alexander Rogozin, and Alexander Gasnikov. Distributed saddle-pointproblems under similarity. arXiv preprint arXiv:2107.10706, 2021c.

Yuyang Deng and Mehrdad Mahdavi. Local stochastic gradient descent ascent: Convergence analysis andcommunication efficiency. In International Conference on Artificial Intelligence and Statistics (AISTATS), pp.1387–1395. PMLR, 2021.

Tim Dettmers. 8-bit approximations for parallelism in deep learning. ICLR, 2015.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. In NAACL-HLT, 2019.

10


Michael Diskin, Alexey Bukhtiyarov, Max Ryabinin, Lucile Saulnier, Quentin Lhoest, Anton Sinitsin, DmitriyPopov, Dmitry Pyrkin, Maxim Kashirin, Alexander Borzunov, Albert Villanova del Moral, Denis Mazur,Ilia Kobelev, Yacine Jernite, Thomas Wolf, and Gennady Pekhimenko. Distributed deep learning in opencollaborations. CoRR, abs/2106.10207, 2021. URL https://arxiv.org/abs/2106.10207.

Darina Dvinskikh and Daniil Tiapkin. Improved complexity bounds in wasserstein barycenter problem. InInternational Conference on Artificial Intelligence and Statistics, pp. 1738–1746. PMLR, 2021.

Francisco Facchinei and Jong-Shi Pang. Finite-Dimensional Variational Inequalities and ComplementarityProblems. Springer Series in Operations Research. Springer, 2003.

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models withsimple and efficient sparsity. arXiv preprint arXiv:2101.03961, 2021.

Avishek Ghosh, Raj Kumar Maity, Arya Mazumdar, and Kannan Ramchandran. Communication efficientdistributed approximate Newton method. In IEEE International Symposium on Information Theory (ISIT),2020. doi: 10.1109/ISIT44484.2020.9174216.

Gauthier Gidel, Hugo Berard, Gaëtan Vignoud, Pascal Vincent, and Simon Lacoste-Julien. A variational inequal-ity perspective on generative adversarial networks. In International Conference on Learning Representations,2019. URL https://openreview.net/forum?id=r1laEnA5Ym.

Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, AaronCourville, and Yoshua Bengio. Generative adversarial networks. In Neural Information Processing Systems,2014.

Eduard Gorbunov, Konstantin Burlachenko, Zhize Li, and Peter Richtárik. MARINA: Faster non-convexdistributed learning with compression. In 38th International Conference on Machine Learning, 2021.

P. T. Harker and J.-S. Pang. Finite-dimensional variational inequality and nonlinear complementarity problems:a survey of theory, algorithms and applications. Mathematical programming, 1990.

Samuel Horvath, Chen-Yu Ho, Ludovit Horvath, Atal Narayan Sahu, Marco Canini, and Peter Richtárik. Naturalcompression for distributed deep learning. arXiv preprint arXiv:1905.10988, 2019.

Samuel Horváth, Dmitry Kovalev, Konstantin Mishchenko, Sebastian Stich, and Peter Richtárik. Stochasticdistributed learning with gradient quantization and variance reduction. arXiv preprint arXiv:1904.05115,2019.

Charlie Hou, Kiran K Thekumparampil, Giulia Fanti, and Sewoong Oh. Efficient algorithms for federated saddlepoint optimization. arXiv preprint arXiv:2102.06333, 2021.

Anatoli Juditsky, Arkadii S. Nemirovskii, and Claire Tauvel. Solving variational inequalities with stochasticmirror-prox algorithm, 2008.

Sai Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and Ananda Suresh. SCAF-FOLD: Stochastic controlled averaging for on-device federated learning. In International Conference onMachine Learning (ICML), 2020.

Sai Praneeth Karimireddy, Quentin Rebjock, Sebastian Stich, and Martin Jaggi. Error feedback fixes signsgdand other gradient compression schemes. In International Conference on Machine Learning, pp. 3252–3261.PMLR, 2019.

Jakub Konecný, H. Brendan McMahan, Felix Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon.Federated learning: strategies for improving communication efficiency. In NIPS Private Multi-Party MachineLearning Workshop, 2016.

G. M. Korpelevich. The extragradient method for finding saddle points and other problems. Matecon, 12:747–756, 1976.

Zhen-Zhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. Albert:A lite bert for self-supervised learning of language representations. In International Conference on LearningRepresentations, 2020.

Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated learning: challenges, methods, andfuture directions. IEEE Signal Processing Magazine, 37(3):50–60, 2020. doi: 10.1109/MSP.2020.2975749.

Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep gradient compression: Reducing thecommunication bandwidth for distributed training. In International Conference on Learning Representations,2018. URL https://openreview.net/forum?id=SkhQHMW0W.

11

https://arxiv.org/abs/2106.10207

https://openreview.net/forum?id=r1laEnA5Ym

https://openreview.net/forum?id=SkhQHMW0W


Xiaodong Liu, Hao Cheng, Pengcheng He, Weizhu Chen, Yu Wang, Hoifung Poon, and Jianfeng Gao. Adversarialtraining for large neural language models. arXiv preprint arXiv:2004.08994, 2020.

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, LukeZettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach. ArXiv,abs/1907.11692, 2019.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deeplearning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017.

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deeplearning models resistant to adversarial attacks. In International Conference on Learning Representations,2018. URL https://openreview.net/forum?id=rJzIBfZAb.

Brendan McMahan, Eider Moore, Daniel Ramage, and Blaise Agüera y Arcas. Federated learning of deepnetworks using model averaging. arXiv preprint arXiv:1602.05629, 2016.

Konstantin Mishchenko, Eduard Gorbunov, Martin Takác, and Peter Richtárik. Distributed learning withcompressed gradient differences. arXiv preprint arXiv:1901.09269, 2019.

Arkadi Nemirovski. Prox-method with rate of convergence o(1/t) for variational inequalities with Lipschitzcontinuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal onOptimization, 15:229–251, 01 2004. doi: 10.1137/S1052623403425629.

Yurii Nesterov. Dual extrapolation and its applications to solving variational inequalities and related problems.Mathematical Programming, 109(2):319–344, 2007.

Maher Nouiehed, Maziar Sanjabi, Tianjian Huang, Jason D Lee, and Meisam Razaviyayn. Solving a class ofnon-convex min-max games using iterative first order methods. arXiv preprint arXiv:1902.08297, 2019.

Lerrel Pinto, James Davidson, Rahul Sukthankar, and Abhinav Gupta. Robust adversarial reinforcement learning.In International Conference on Machine Learning, 2017.

Peter Richtárik, Igor Sokolov, and Ilyas Fatkhullin. EF21: A new, simpler, theoretically better, and practicallyfaster error feedback. arXiv preprint arXiv:2106.05203, 2021.

Max Ryabinin, Eduard Gorbunov, Vsevolod Plokhotnyuk, and Gennady Pekhimenko. Moshpit sgd:Communication-efficient decentralized training on heterogeneous unreliable devices, 2021.

Frank Seide, Hao Fu, Jasha Droppo, Gang Li, and Dong Yu. 1-bit stochastic gradient descent and its applicationto data-parallel distributed training of speech dnns. In Fifteenth Annual Conference of the InternationalSpeech Communication Association, 2014.

Shai Shalev-Shwartz and Shai Ben-David. Understanding machine learning: From theory to algorithms.Cambridge university press, 2014.

V. Smith, S. Forte, C. Ma, M. Takác, M. I. Jordan, and M. Jaggi. CoCoA: A general framework forcommunication-efficient distributed optimization. Journal of Machine Learning Research, 18:1–49, 2018.

Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E. Schapire. Fast convergence of regularizedlearning in games. In Neural Information Processing Systems, 2015.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Ł ukasz Kaiser,and Illia Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach,R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Sys-tems 30, pp. 5998–6008. Curran Associates, Inc., 2017. URL http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf.

Joost Verbraeken, Matthijs Wolting, Jonathan Katzy, Jeroen Kloppenburg, Tim Verbelen, and Jan S Rellermeyer.A survey on distributed machine learning. ACM Computing Surveys, 2019.

Thijs Vogels, Sai Praneeth Karinireddy, and Martin Jaggi. Powersgd: Practical low-rank gradient compressionfor distributed optimization. Advances In Neural Information Processing Systems 32 (Nips 2019), 32(CONF),2019.

Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461,2018.

12

https://openreview.net/forum?id=rJzIBfZAb

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf

http://papers.nips.cc/paper/7181-attention-is-all-you-need.pdf


Deming Yuan, Qian Ma, and Zhen Wang. Dual averaging method for solving multi-agent saddle-point problemswith quantized information. Transactions of the Institute of Measurement and Control, 36(1):38–46, 2014.doi: 10.1177/0142331213487545. URL https://doi.org/10.1177/0142331213487545.

Chen Zhu, Yu Cheng, Zhe Gan, Siqi Sun, Tom Goldstein, and Jingjing Liu. Freelb: Enhanced adversarialtraining for natural language understanding. arXiv preprint arXiv:1909.11764, 2019.

13

https://doi.org/10.1177/0142331213487545


APPENDIXCONTENTS

1 Introduction 1

1.1 The expressive power of variational inequalities . . . . . . . . . . . . . . . . . . . 1

1.2 Training of supervised models via distributed optimization . . . . . . . . . . . . . 2

1.3 Two classes of compression operators . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Towards communication-efficient distributed methods for VIs . . . . . . . . . . . . 2

2 Summary of Contributions 3

2.1 Two distributed problems: deterministic and stochastic . . . . . . . . . . . . . . . 3

2.2 Two new methods with compressed communication: MASHA1 and MASHA2 . . . . 4

2.3 Theoretical complexity results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

3 Problem Formulation and Assumptions 4

3.1 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

3.2 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4 MASHA1: Handling Unbiased Compressors 5

4.1 The Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.2 Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

5 Experiments 8

5.1 Bilinear Saddle Point Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

5.2 Adversarial Training of Transformers . . . . . . . . . . . . . . . . . . . . . . . . 9

A Proof of Theorem 1 15

A.1 A lemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.2 Deterministic case: Theorem 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

A.3 Strongly monotone/convex case . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

A.4 Monotone/convex case (µh = 0, µF = 0) . . . . . . . . . . . . . . . . . . . . . . 19

A.5 Stochastic case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

B MASHA2: Handling Contractive Compressors 24

C Proof of Theorem 2 24

C.1 Strongly monotone case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

D Motivating Examples 30

14


A PROOF OF THEOREM 1

A.1 A LEMMA

Lemma 1 Let h be µh - strongly convex and z+ = proxγh(z). Then for all x ∈ Rd the followinginiqulity holds:

〈z+ − z, x− z+〉 ≥ γ(h(z+)− h(x) +

µh2‖z+ − x‖2

). (14)

Proof: We use γµ-strong convexity of the function γh (8):

γ(h(x)− h(z+)

)− 〈γ∇h(z+), x− z+〉 ≥ γµh

2‖x− z+‖2.

With definition of prox and necessary optimality condition: γ∇h(z+) = z − z+, it completes theproof.

In the next theorem we will use the following notation:

gk = F (wk), gk+1/2 = Qserv

[1

M

M∑m=1


]+ F (wk).

A.2 DETERMINISTIC CASE: THEOREM 1

.

Proof of Theorem 1: By Lemma 1 for zk+1/2 = proxγh(zk − γgk) and zk+1 = proxγh(zk −γgk+1/2) with x = u we get

〈zk+1 − zk + γgk+1/2, u− zk+1〉 ≥ γ(h(zk+1)− h(u) +

µh2‖zk+1 − u‖2

),

〈zk+1/2 − zk + γgk, zk+1 − zk+1/2〉 ≥ γ(h(zk+1/2)− h(zk+1) +

µh2‖zk+1 − zk+1/2‖2

).

Then we sum two inequalities and make some rearrangement:

〈zk+1 − zk, u− zk+1〉+ 〈zk+1/2 − zk, zk+1 − zk+1/2〉+ γ〈gk+1/2 − gk, zk+1/2 − zk+1〉+ γ〈gk+1/2, u− zk+1/2〉

≥ γ(h(zk+1/2)− h(u) +

µh2‖zk+1 − zk+1/2‖2 +

µh2‖zk+1 − u‖2

).

Multiplying by 2 and using definition of zk, we have

2τ〈zk+1 − zk, u− zk+1〉+ 2(1− τ)〈zk+1 − wk, u− zk+1〉+ 2τ〈zk+1/2 − zk, zk+1 − zk+1/2〉+ 2(1− τ)〈zk+1/2 − wk, zk+1 − zk+1/2〉+ 2γ〈gk+1/2 − gk, zk+1/2 − zk+1〉+ 2γ〈gk+1/2, u− zk+1/2〉

≥ 2γ(h(zk+1/2)− h(u) +

µh2‖zk+1 − zk+1/2‖2 +

µh2‖zk+1 − u‖2

).

For the first and second lines we use identity 2〈a, b〉 = ‖a+ b‖2 − ‖a‖2 − ‖b‖2, and get

τ(‖zk − u‖2 − ‖zk+1 − zk‖2 − ‖zk+1 − u‖2

)+ (1− τ)

(‖wk − u‖2 − ‖zk+1 − wk‖2 − ‖zk+1 − u‖2

)+ τ(‖zk+1 − zk‖2 − ‖zk+1/2 − zk‖2 − ‖zk+1 − zk+1/2‖2)

+ (1− τ)(‖zk+1 − wk‖2 − ‖zk+1/2 − wk‖2 − ‖zk+1 − zk+1/2‖2)

+ 2γ〈gk+1/2 − gk, zk+1/2 − zk+1〉+ 2γ〈gk+1/2, u− zk+1/2〉

≥ 2γ(h(zk+1/2)− h(u) +

µh2‖zk+1 − zk+1/2‖2 +

µh2‖zk+1 − u‖2

).

15


A small rearrangement gives

(1 + γµh)‖zk+1 − u‖2 ≤ τ‖zk − u‖2 + (1− τ)‖wk − u‖2

− τ‖zk+1/2 − zk‖2 − (1− τ)‖zk+1/2 − wk‖2 − (1 + γµh)‖zk+1 − zk+1/2‖2

+ 2γ〈gk+1/2 − gk, zk+1/2 − zk+1〉 − 2γ〈gk+1/2, zk+1/2 − u〉 − 2γ(h(zk+1/2)− h(u)

).

By simple fact: 2〈a, b〉 ≤ η‖a‖2 + 1η‖b‖

2 with a = gk+1/2 − gk, b = zk+1/2 − zk+1, η = 2γ, weget

(1 + γµh)‖zk+1 − u‖2 ≤ τ‖zk − u‖2 + (1− τ)‖wk − u‖2

− τ‖zk+1/2 − zk‖2 − (1− τ)‖zk+1/2 − wk‖2 −(

1

2+ γµh

)‖zk+1 − zk+1/2‖2

+ 2γ2‖gk+1/2 − gk‖2 − 2γ〈gk+1/2, zk+1/2 − u〉 − 2γ(h(zk+1/2)− h(u)

). (15)

We now consider the two cases of the theorem separately.

A.3 STRONGLY MONOTONE/CONVEX CASE

Let substitute u = z∗, take full mathematical expectation and get

(1 + γµh)E[‖zk+1 − z∗‖2

]≤ τE

[‖zk − z∗‖2

]+ (1− τ)E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]− (1− τ)E

[‖zk+1/2 − wk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]+ 2γ2E

[‖gk+1/2 − gk‖2

]− 2γE

[〈gk+1/2, zk+1/2 − z∗〉+ h(zk+1/2)− h(z∗)

]= τE

[‖zk − z∗‖2

]+ (1− τ)E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]− (1− τ)E

[‖zk+1/2 − wk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]+ 2γ2E

[‖gk+1/2 − gk‖2

]− 2γE

[〈E[gk+1/2 | zk+1/2

], zk+1/2 − z∗〉+ h(zk+1/2)− h(z∗)

].

(16)

Let us work with E[‖gk+1/2 − gk‖2

], with (1) we get

E[‖gk+1/2 − gk‖2

]= E

∥∥∥∥∥Qserv

[1

M

M∑m=1


]∥∥∥∥∥2

≤ qserv

M2E

∥∥∥∥∥M∑m=1


∥∥∥∥∥2

=qserv

M2

M∑m=1

E[∥∥∥Qdev

m (Fm(zk+1/2)− Fm(wk))∥∥∥2]

+qserv

M2

∑m 6=l

E[〈Qdev

m (Fm(zk+1/2)− Fm(wk));Qdevl (Fl(z

k+1/2)− Fl(wk))〉]

16


Next we apply (1) and Assumption 2 for the first term and independence and unbiasedness of Q forthe second term:

E[‖gk+1/2 − gk‖2

]≤ qserv

M2

M∑m=1

qdevm L2

mE[∥∥∥zk+1/2 − wk

∥∥∥2]+qserv

M2

∑m 6=l

E[〈Fm(zk+1/2)− Fm(wk);Fl(z

k+1/2)− Fl(wk)〉]

≤ qserv

M2

M∑m=1

qdevm L2

mE[∥∥∥zk+1/2 − wk

∥∥∥2]+qserv

2M2

∑m6=l

E[‖Fm(zk+1/2)− Fm(wk)‖2 + ‖Fl(zk+1/2)− Fl(wk)‖2

]

≤ qserv

M2

M∑m=1

qdevm L2

mE[∥∥∥zk+1/2 − wk

∥∥∥2]+qserv

2M2

∑m6=l

E[L2m‖zk+1/2 − wk‖2 + L2

l ‖zk+1/2 − wk‖2]

=qserv

M2

M∑m=1

qdevm L2

mE[∥∥∥zk+1/2 − wk

∥∥∥2]+qserv(M − 1)

M2

M∑m=1

L2mE

[‖zk+1/2 − wk‖2

]=qserv

M2E[‖zk+1/2 − wk‖2

]·M∑m=1

(qdevm +M − 1)L2

m (17)

Let us define new constant Cq =√

qserv

M2

∑Mm=1(qdev

m +M − 1)L2m and then connect (16) and (17):

(1 + γµh)E[‖zk+1 − z∗‖2

]≤ τE

[‖zk − z∗‖2

]+ (1− τ)E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]− (1− τ)E

[‖zk+1/2 − wk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]+ 2γ2C2

qE[‖zk+1/2 − wk‖2

]− 2γE

[〈F (zk+1/2), zk+1/2 − z∗〉+ h(zk+1/2)− h(z∗)

]= τE

[‖zk − z∗‖2

]+ (1− τ)E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]−((1− τ)− 2γ2C2

q

)E[‖zk+1/2 − wk‖2

]− 2γE

[〈F (zk+1/2), zk+1/2 − z∗〉+ h(zk+1/2)− h(z∗)

].

The property of the solution (3) gives

(1 + γµh)E[‖zk+1 − z∗‖2

]≤ τE

[‖zk − z∗‖2

]+ (1− τ)E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]−((1− τ)− 2γ2C2

q

)E[‖zk+1/2 − wk‖2

]− 2γE

[〈F (zk+1/2)− F (z∗), zk+1/2 − z∗〉

].

And by Assumption 2 in strong monotone case we have

(1 + γµh)E[‖zk+1 − z∗‖2

]≤ τE

[‖zk − z∗‖2

]+ (1− τ)E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]−((1− τ)− 2γ2C2

q

)E[‖zk+1/2 − wk‖2

]− 2γµFE

[‖zk+1/2 − z∗‖2

].

17


One the other hand we get

E[‖wk+1 − z∗‖2

]= (1− τ)E

[‖zk+1 − z∗‖2

]+ τE

[‖wk − z∗‖2

].

Summing two previous inequalities:

τE[‖zk+1 − z∗‖2

]+ E

[‖wk+1 − z∗‖2

]≤ τE

[‖zk − z∗‖2

]+ E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]− γµhE

[‖zk+1 − z∗‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]−((1− τ)− 2γ2C2

q

)E[‖zk+1/2 − wk‖2

]− 2γµFE

[‖zk+1/2 − z∗‖2

].

We have Lyapunov function in the left side:

E [Vk+1] = τE[‖zk+1 − z∗‖2

]+ E

[‖wk+1 − z∗‖2

]≤ τE

[‖zk − z∗‖2

]+ E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]− 2γµFE

[‖zk+1/2 − z∗‖2

]− γµhE

[‖zk+1 − z∗‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]−((1− τ)− 2γ2C2

q

)E[‖zk+1/2 − wk‖2

].

With −‖zk+1 − z∗‖2 ≤ − 12‖z

k+1/2 − z∗‖2 + ‖zk+1 − zk+1/2‖2 we deduce:

E [Vk+1] ≤ τE[‖zk − z∗‖2

]+ E

[‖wk − z∗‖2

]− τE

[‖zk+1/2 − zk‖2

]−((1− τ)− 2γ2C2

q

)E[‖zk+1/2 − wk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]− γ

(2µF +

µh2

)τE[‖zk+1/2 − z∗‖2

]− γ

(2µF +

µh2

)· (1− τ)E

[‖zk+1/2 − z∗‖2

].

A simple facts: ‖zk+1/2 − z∗‖2 ≥ 12‖z

k − z∗‖2 − ‖zk+1/2 − zk‖2 and ‖zk+1/2 − z∗‖2 ≥ 12‖w

k −z∗‖2 − ‖zk+1/2 − wk‖2, gives

E [Vk+1] ≤ τE[‖zk − z∗‖2

]+ E

[‖wk − z∗‖2

]−(

(1− τ)− 2γ2C2q − γ

(2µF +

µh2

)· (1− τ)

)E[‖zk+1/2 − wk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]−(

1− γ(

2µF +µh2

))τE[‖zk+1/2 − zk‖2

]− γ

(µF +

µh4

)τE[‖zk − z∗‖2

]− γ

(µF +

µh4

)· (1− τ)E

[‖wk − z∗‖2

]. (18)

Next we work with the last line of (18):

−γ(µF +

µh4

)τE[‖zk − z∗‖2

]− γ

(µF +

µh4

)· (1− τ)E

[‖wk − z∗‖2

]= −γ

2

(µF +

µh4

)τE[‖zk − z∗‖2

]− γ

2

(µF +

µh4

)τE[‖zk − z∗‖2

]− γ

(µF +

µh4

)· (1− τ)E

[‖wk − z∗‖2

]≤ −γ

2

(µF +

µh4

)τE[‖zk − z∗‖2

]− γ

4

(µF +

µh4

)τE[‖wk − z∗‖2

]+γ

2

(µF +

µh4

)τE[‖zk − wk‖2

]− γ

(µF +

µh4

)· (1− τ)E

[‖wk − z∗‖2

]≤ −γ

4

(µF +

µh4

)τE[‖zk − z∗‖2

]− γ

4

(µF +

µh4

)E[‖wk − z∗‖2

]+γ

2

(µF +

µh4

)τE[‖zk − wk‖2

]≤ −γ

4

(µF +

µh4

)τE[‖zk − z∗‖2

]− γ

4

(µF +

µh4

)E[‖wk − z∗‖2

]+ γ

(µF +

µh4

)τE[‖zk+1/2 − zk‖2

]+ γ

(µF +

µh4

)τE[‖zk+1/2 − wk‖2

].

18


Substituting this into (18), we get

E [Vk+1] ≤ τE[‖zk − z∗‖2

]+ E

[‖wk − z∗‖2

]−(

(1− τ)− 2γ2C2q − γ

(2µF +

µh2

))E[‖zk+1/2 − wk‖2

]−(

1

2+ γµh

)E[‖zk+1 − zk+1/2‖2

]−(

1− 3γ(µF +

µh4

))τE[‖zk+1/2 − zk‖2

]− γ

4

(µF +

µh4

)τE[‖zk − z∗‖2

]− γ

4

(µF +

µh4

)E[‖wk − z∗‖2

]. (19)

It remains only to choose γ ≤ min√

1−τ4Cq

; 1−τ4(µF+µh)

and get

E [Vk+1] ≤(

1− γ · µF + µh16

)· E [Vk] .

Running the recursion completes the proof.

A.4 MONOTONE/CONVEX CASE (µh = 0, µF = 0)

We start from (15) with additional notation gap(zk+1/2, u) = 〈F (zk+1/2), zk+1/2−u〉+h(zk+1/2)−h(u):

2γ · gap(zk+1/2, u) + ‖zk+1 − u‖2 ≤ τ‖zk − u‖2 + (1− τ)‖wk − u‖2

− τ‖zk+1/2 − zk‖2 − (1− τ)‖zk+1/2 − wk‖2 + 2γ2‖gk+1/2 − gk‖2

− 2γ〈gk+1/2 − F (zk+1/2), zk+1/2 − u〉.

Adding both sides ‖wk+1 − u‖2 and making small rearrangement we have

2γ · gap(zk+1/2, u) ≤[τ‖zk − u‖2 + ‖wk − u‖2

]−[τ‖zk+1 − u‖2 + ‖wk+1 − u‖2

]− τ‖wk − u‖2 − (1− τ)‖zk+1 − u‖2 + ‖wk+1 − u‖2

− τ‖zk+1/2 − zk‖2 − (1− τ)‖zk+1/2 − wk‖2 + 2γ2‖gk+1/2 − gk‖2

− 2γ〈gk+1/2 − F (zk+1/2), zk+1/2 − u〉.

Then we sum up over k = 0, . . . ,K − 1, take maximum of both sides over z ∈ C, after takeexpectation and get

2γ·E

[maxu∈C

K−1∑k=0

gap(zk+1/2, u)

]≤ max

u∈C

[τ‖z0 − u‖2 + ‖w0 − u‖2

]+ E

[maxu∈C

K−1∑k=0

[−τ‖wk − u‖2 − (1− τ)‖zk+1 − u‖2 + ‖wk+1 − u‖2

]]

−K−1∑k=0

[τE[‖zk+1/2 − zk‖2

]+ (1− τ)E

[‖zk+1/2 − wk‖2

]− 2γ2E

[‖gk+1/2 − gk‖2

]]+ 2γE

[maxu∈C

K−1∑k=0

[〈gk+1/2 − F (zk+1/2), u− zk+1/2〉

]].

19


Applying (17) for E[‖gk+1/2 − gk‖2

], we get

2γ·E

[maxu∈C

K−1∑k=0

gap(zk+1/2, u)

]≤ max

u∈C

[τ‖z0 − u‖2 + ‖w0 − u‖2

]+ E

[maxu∈C

K−1∑k=0

[−τ‖wk − u‖2 − (1− τ)‖zk+1 − u‖2 + ‖wk+1 − u‖2

]]

−K−1∑k=0

[τE[‖zk+1/2 − zk‖2

]+ (1− τ)E

[‖zk+1/2 − wk‖2

]− 2γ2C2

qE[‖zk+1/2 − wk‖2

] ]+ 2γE

[maxu∈C

K−1∑k=0

[〈gk+1/2 − F (zk+1/2), u− zk+1/2〉

]].

With γ ≤√1−τ2Cq

we get

2γ·E

[maxu∈C

K−1∑k=0

gap(zk+1/2, u)

]≤ max

u∈C

[τ‖z0 − u‖2 + ‖w0 − u‖2

]+ E

[maxu∈C

K−1∑k=0

[−τ‖wk − u‖2 − (1− τ)‖zk+1 − u‖2 + ‖wk+1 − u‖2

]]

+ 2γE

[maxu∈C

K−1∑k=0

[〈gk+1/2 − F (zk+1/2), u− zk+1/2〉

]]. (20)

To finish the proof we need to estimate terms in two last lines. We begin with

E[maxu∈C

K−1∑k=0

〈F (zk+1/2)− gk+1/2, zk+1/2 − u〉]

. Let define sequence v: v0 = z0, vk+1 =

proxγh(vk − γδk) with δk = F (zk+1/2)− gk+1/2. Then we have

K−1∑k=0

〈δk, zk+1/2 − u〉 =

K−1∑k=0

〈δk, zk+1/2 − vk〉+

K−1∑k=0

〈δk, vk − u〉. (21)

By the definition of vk+1 (property of prox), we have for all z ∈ Z

〈vk+1 − vk + γδk, z − vk+1〉 ≥ 0.

Rewriting this inequality, we get

〈γδk, vk − z〉 ≤ 〈γδk, vk − vk+1〉+ 〈vk+1 − vk, z − vk+1〉

≤ 〈γδk, vk − vk+1〉+1

2‖vk − z‖2 − 1

2‖vk+1 − z‖2 − 1

2‖vk − vk+1‖2

≤ γ2

2‖δk‖2 +

1

2‖vk − vk+1‖2 +

1

2‖vk − z‖2 − 1

2‖vk+1 − z‖2 − 1

2‖vk − vk+1‖2

=γ2

2‖δk‖2 +

1

2‖vk − z‖2 − 1

2‖vk+1 − z‖2.

With (21) it givesK−1∑k=0

〈δk, zk+1/2 − u〉 ≤K−1∑k=0

〈δk, zk+1/2 − vk〉+1

γ

K−1∑k=0

(γ2

2‖δk‖2 +

1

2‖vk − u‖2 − 1

2‖vk+1 − u‖2

)

≤K−1∑k=0

〈δk, zk+1/2 − vk〉+γ

2

K−1∑k=0

‖δk‖2 +1

2γ‖v0 − u‖2.

20


We take the maximum on u and get

maxu∈C

K−1∑k=0

〈δk, zk+1/2 − u〉 ≤K−1∑k=0

〈δk, zk+1/2 − vk〉+γ

2

K−1∑k=0

‖F (zk+1/2)− gk+1/2‖2 +1

2γmaxu∈C‖v0 − u‖2.

Taking the full expectation, we get

E

[maxu∈C

K−1∑k=0

〈δk, zk+1/2 − u〉

]≤ E

[K−1∑k=0

〈δk, zk+1/2 − vk〉

]

+γ

2

K−1∑k=0

E[‖F (zk+1/2)− gk+1/2‖2

]+

1

2γmaxu∈C‖v0 − u‖2

= E

[K−1∑k=0

〈E[F (zk+1/2)− gk+1/2 | zk+1/2 − vk

], zk+1/2 − vk〉

]

+γ

2

K−1∑k=0

E[‖F (zk+1/2)− gk+1/2‖2

]+

1

2γmaxu∈C‖v0 − u‖2

=γ

2

K−1∑k=0

E[‖F (zk+1/2)− gk+1/2‖2

]+

1

2γmaxu∈C‖v0 − u‖2. (22)

Now let us estimate E[maxu∈C

K−1∑k=0

[−τ‖wk − u‖2 − (1− τ)‖zk+1 + u‖2 + ‖wk+1 − u‖2

]], for this

we note that

E

[maxu∈C

K−1∑k=0

[−τ‖wk − u‖2 − (1− τ)‖zk+1 − u‖2 + ‖wk+1 − u‖2

]]

= E

[maxu∈C

K−1∑k=0

[−2〈(1− τ)zk+1 + τwk − wk+1, u〉 − (1− τ)‖zk+1‖2 − τ‖wk‖2 + ‖wk+1‖2

]]

= E

[maxu∈C

K−1∑k=0

[−2〈(1− τ)zk+1 + τwk − wk+1, u〉

]]

+ E

[K−1∑k=0

−(1− τ)‖zk+1‖2 − τ‖wk‖2 + ‖wk+1‖2].

One can note that by definition wk+1: E[(1− τ)‖zk+1‖2 + τ‖wk‖2 − ‖wk+1‖2

]= 0, then

E

[maxu∈C

K−1∑k=0

[−τ‖wk − u‖2 − (1− τ)‖zk+1 − u‖2 + ‖wk+1 − u‖2

]]

= 2E

[maxu∈C

K−1∑k=0

〈(1− τ)zk+1 + τwk − wk+1,−u〉

]

= 2E

[maxu∈C

K−1∑k=0

〈(1− τ)zk+1 + τwk − wk+1, u〉

].

21


Further, one can carry out the reasoning similarly to chain for (22):

E

[maxu∈C

K−1∑k=0

[τ‖wk − u‖2 + (1− τ)‖zk+1 − u‖2 − ‖wk+1 − u‖2

]]

≤K−1∑k=0

E[‖(1− τ)zk+1 + τwk − wk+1‖2

]+ max

u∈C‖v0 − u‖2

=

K−1∑k=0

E[‖Ewk+1 [wk+1]− wk+1‖2

]+ max

u∈C‖v0 − u‖2

=

K−1∑k=0

E[−‖Ewk+1 [wk+1]‖2 + Ewk+1‖wk+1‖2

]+ max

u∈C‖v0 − u‖2

=

K−1∑k=0

E[−‖(1− τ)zk+1 + τwk‖2 + (1− τ)‖zk+1‖2 + τ‖wk‖2

]+ max

u∈C‖v0 − u‖2

=

K−1∑k=0

τ(1− τ)E[‖zk+1 − wk‖2

]+ max

u∈C‖v0 − u‖2. (23)

Substituting (22) and (23) in (20) we get

2γ·E

[maxu∈C

K−1∑k=0

gap(zk+1/2, u)

]≤ max

u∈C

[(2 + τ)‖z0 − u‖2 + ‖w0 − u‖2

]+

K−1∑k=0

[τ(1− τ)E

[‖zk+1 − wk‖2

]+ γ2E

[‖F (zk+1/2)− gk+1/2‖2

]]. (24)

Next we work separately with E[‖F (zk+1/2)− gk+1/2‖2

]:

E[‖F (zk+1/2)− gk+1/2‖2

]= E


[1

M

M∑m=1


]+ F (wk)− F (zk+1/2)

∥∥∥∥∥2

= E


[1

M

M∑m=1


]∥∥∥∥∥2+ E

[‖F (zk+1/2)− F (wk)‖2

]

+ E

[〈Qserv

[1

M

M∑m=1


];F (zk+1/2)− F (wk)〉

].

With (17) we get

E[‖F (zk+1/2)− gk+1/2‖2

]≤ C2

qE[∥∥∥zk+1/2 − wk

∥∥∥2]+ E[‖F (zk+1/2)− F (wk)‖2

]+ E

[〈 1

M

M∑m=1

Qdevm (Fm(zk+1/2)− Fm(wk));F (zk+1/2)− F (wk)〉

]

= C2qE[∥∥∥zk+1/2 − wk

∥∥∥2]+ 2E[‖F (zk+1/2)− F (wk)‖2

]≤ C2

qE[∥∥∥zk+1/2 − wk

∥∥∥2]+2

M

M∑m=1

L2m · E

[∥∥∥zk+1/2 − wk∥∥∥2] .

(25)

22


With new notation L2 = 1M

M∑m=1

L2m from (24) and (25) we have

2γ·E

[maxu∈C

K−1∑k=0

gap(zk+1/2, u)

]≤ max

u∈C

[(2 + τ)‖z0 − u‖2 + ‖w0 − u‖2

]+

K−1∑k=0

[τ(1− τ)E

[‖zk+1 − wk‖2

]+ γ2(C2

q + 2L2)E[‖zk+1/2 − wk‖2

]].

With γ ≤√1−τ

2√C2

q+2L2we deduce to

2γ·E

[maxu∈C

K−1∑k=0

gap(zk+1/2, u)

]≤ max

u∈C

[(2 + τ)‖z0 − u‖2 + ‖w0 − u‖2

]+ (1− τ)

K−1∑k=0

[E[‖zk+1 − wk‖2

]+ E

[‖zk+1/2 − wk‖2

]]≤ max

u∈C

[(2 + τ)‖z0 − u‖2 + ‖w0 − u‖2

]+ 3(1− τ)

K−1∑k=0

[E[‖zk+1 − zk+1/2‖2

]+ E

[‖zk+1/2 − wk‖2

]].

Let us go back to (19) with µh = 0, µF = 0, γ ≤√1−τ4Cq

and get that

E [Vk+1] ≤ E [Vk]−((1− τ)− 2γ2C2

q

)E[‖zk+1/2 − wk‖2

]− 1

2E[‖zk+1 − zk+1/2‖2

]≤ E [Vk]− (1− τ)

2E[‖zk+1/2 − wk‖2

]− (1− τ)

2E[‖zk+1 − zk+1/2‖2

].

Hence substituting this we go to the end of the proof:

2γ·E

[maxu∈C

K−1∑k=0

gap(zk+1/2, u)

]≤ max

u∈C

[(2 + τ)‖z0 − u‖2 + ‖w0 − u‖2

]+ 6

K−1∑k=0

[E [Vk]− E [Vk+1]]

≤ maxu∈C

[(2 + 7τ)‖z0 − u‖2 + 7‖w0 − u‖2

]≤ max

u∈C

[16‖z0 − u‖2

].

It remains to slightly correct the convergence criterion by monotonicity of F and Jensen’s inequalityfor convex functions:

E

[maxu∈C

K−1∑k=0

gap(zk+1/2, u)

]= E

[maxu∈C

K−1∑k=0

[〈F (zk+1/2), zk+1/2 − u〉+ h(zk+1/2)− h(u)

]]

≥ E

[maxu∈C

K−1∑k=0

[〈F (u), zk+1/2 − u〉+ h(zk+1/2)− h(u)

]]

≥ E[K ·max

u∈C

[〈F (u), zK − u〉+ h(zK)− h(u)

]]= K · E

[Gap(zK)

],

23


where we additionally use zK = 1K

K−1∑k=0

zk+1/2. This brings us to

E[Gap(zK)

]≤

8 maxu∈C[‖z0 − u‖2

]γK

.

Theorem 1 is completely proved for deterministic case.

A.5 STOCHASTIC CASE

The case of a finite sum (stochastic) is proved in a similar way. We need to replace Fm with Fm,πkm

.

B MASHA2: HANDLING CONTRACTIVE COMPRESSORS

Now we present a method for working with biased compression operators – MASHA2. Algorithm 1and Algorithm 2 are similar. The main and key difference is the error feedback technique Karimireddyet al. (2019), which is classic for working with biased compressors. To do this, we need to introduceadditional sequences the sequence em. The purpose of these sequences is to accumulate error -something that is not communicated in previous iterations. Additionally, for Algorithm 2, we considera simpler setting than for Algorithm 1, namely Z = Rd and h = 0. In this case proxγh(z) = z. Also,in the case of comps, we compress the information in one direction - the server makes a full broadcast.

Similarly to Theorem 1 we use Lyapunov function

Vk = τ‖zk − z∗‖2 + ‖wk − z∗‖2,

where zk = zk − 1M

M∑m=1

ekm, wk = wk − 1M

M∑m=1

ekm.

Theorem 2 Let Assumptions 1 and 2 (SM) be satisfied. Then, if γ ≤ min(

1−τ4µF

,√1−τ

60δL ,µF (1−τ)105τ2δ2L2

),

the following estimates holds

E[VK

]≤(1− γ · µF

4

)K−1 · V0.The proof of the above theorem can be found in Appendix C.

C PROOF OF THEOREM 2

We first introduce useful notation:

zk = zk − 1

M

M∑m=1

ekm, zk+1/2 = zk+1/2 − 1

M

M∑m=1

ekm, wk = wk − 1

M

M∑m=1

ekm.

It is easy to verify that

zk+1 = zk+1 − 1

M

M∑m=1

ek+1m

= zk+1/2 − 1

M

M∑m=1

Cdevm (γ · Fm(zk+1/2)− γ · Fm(wk) + ekm)

− 1

M

M∑m=1

[ekm + γ · Fm(zk+1/2)− γ · Fm(wk)− Cdev

m (γ · Fm(zk+1/2)− γ · Fm(wk) + ekm)]

= zk+1/2 − 1

M

M∑m=1

ekm − γ · F (zk+1/2)

= zk+1/2 − γ · (F (zk+1/2)− F (wk)). (26)

24


Algorithm 2 MASHA2 (handling contractive compressors)

Parameters: Stepsize γ > 0, number f iterations K.Initialization: Choose z0 = w0 ∈ Z , e0m = 0.Server sends to devices z0 = w0 and devices compute Fm(w0) and send to server and get F (w0)for k = 0, 1, 2, . . . ,K − 1 do

for all devices in parallel dozk = τzk + (1− τ)wk

zk+1/2 = zk − γ · F (wk)Generate πkm from 1, . . . , r independentlyCompute Fm(zk+1/2) and send to server Cdev

m (γ · Fm(zk+1/2)− γ · Fm(wk) + ekm)ek+1m = ekm + γ ·Fm(zk+1/2)− γ ·Fm(wk)−Cdev

m (γ ·Fm(zk+1/2)− γ ·Fm(wk) + ekm)Compute Fm,πk

m(zk+1/2) and send to serverCdev

m (γ·Fm,πkm

(zk+1/2)−γ·Fm,πkm

(wk)+ekm)

ek+1m = ekm + γ · Fm,πk

m(zk+1/2) − γ · Fm,πk

m(wk) − Cdev

m (γ · Fm,πkm

(zk+1/2) − γ ·Fm,πk

m(wk) + ekm)

end forfor server do

Compute Qserv[

1M

M∑m=1

Cdevm (γFm(zk+1/2)− γFm(wk) + ekm)

]& send to devices

ComputeQserv[

1M

M∑m=1

Cdevm (γFm,πk

m(zk+1/2)− γFm,πk

m(wk) + ekm)

]& send to devices

Sends to devices one bit bk: 1 with probability 1− τ , 0 with with probability τend forfor all devices in parallel do

zk+1 = zk+1/2 − 1M

∑Mm=1 C

devm (γ · Fm(zk+1/2)− γ · Fm(wk) + ekm)

zk+1 = zk+1/2 − 1M

∑Mm=1 C

devm (γ · Fm,πk

m(zk+1/2)− γ · Fm,πk

m(wk) + ekm)

if bk = 1 thenwk+1 = zk+1

Compute Fm(wk+1) and it send to server; and get F (wk+1)else

wk+1 = wk

end ifend for

end for

Because of such a beautiful property for hat sequences, we will use them in the proof.

Proof of Theorem 2: We start from these two equalities:

‖zk+1 − z∗‖2 = ‖zk+1/2 − z∗‖2 + 2〈zk+1 − zk+1/2, zk+1/2 − z∗〉+ ‖zk+1 − zk+1/2‖2,

‖zk+1/2 − z∗‖2 = ‖zk − z∗‖2 + 2〈zk+1/2 − zk, zk+1/2 − z∗〉 − ‖zk+1/2 − zk‖2.Summing up, we obtain

‖zk+1 − z∗‖2 = ‖zk − z∗‖2 + 2〈zk+1 − zk, zk+1/2 − z∗〉+ ‖zk+1 − zk+1/2‖2 − ‖zk+1/2 − zk‖2.(27)

Using that ‖a+ b‖2 ≤ 2‖a‖2 + 2‖b‖2 and (26), we get

‖zk+1 − zk+1/2‖2 ≤ 2‖zk+1 − zk+1/2‖2 + 2‖zk+1/2 − zk+1/2‖2

≤ 2γ2L2 · ‖zk+1/2 − wk‖2 +2

M

M∑m=1

∥∥ekm∥∥2≤ 2γ2L2 · ‖zk+1/2 − wk‖2 +

2

M

M∑m=1

∥∥ekm∥∥2 . (28)

25


Additionally, here we use that F is L Lipschitz, where L2 = 1M

M∑m=1

L2m.

Next, (27) with (28) gives‖zk+1 − z∗‖2 ≤ ‖zk − z∗‖2 + 2〈zk+1 − zk, zk+1/2 − z∗〉

+ 2γ2L2 · ‖zk+1/2 − wk‖2 +2

M

M∑m=1

∥∥ekm∥∥2 − ‖zk+1/2 − zk‖2. (29)

Now we consider the next inner product 〈zk+1 − zk, zk+1/2 − z∗〉. Using that

zk+1 − zk = zk+1 − zk+1/2 + zk+1/2 − zk = −γ · (F (zk+1/2)− F (wk)) + zk+1/2 − zk

= −γ · F (zk+1/2) + zk − zk.and optimality condition 〈F (z∗), zk+1/2 − z∗〉 ≥ 0, we get

2〈zk+1 − zk, zk+1/2 − z∗〉 = 2〈−γ · F (zk+1/2) + zk − zk, zk+1/2 − z∗〉≤ 2〈γ · (F (z∗)− F (zk+1/2)), zk+1/2 − z∗〉

+ 2〈zk − zk, zk+1/2 − z∗〉.

C.1 STRONGLY MONOTONE CASE

With Assumption 2 (SM) we obtain2〈zk+1 − zk, zk+1/2 − z∗〉 ≤ 2〈γ · (F (z∗)− F (zk+1/2)), zk+1/2 − z∗〉+ 2〈zk − zk, zk+1/2 − z∗〉

≤ −2µF γ‖zk+1/2 − z∗‖2 + 2(1− τ)〈wk − zk, zk+1/2 − z∗〉. (30)

Additionally we use here definition of zk and fact that wk − zk = wk − zk. Combining the obtainedinequalities (29) and (30), we have‖zk+1 − z∗‖2 ≤ ‖zk − z∗‖2 − 2µF γ‖zk+1/2 − z∗‖2 + 2(1− τ)〈wk − zk, zk+1/2 − z∗〉

2γ2L2 · ‖zk+1/2 − wk‖2 +2

M

M∑m=1

∥∥ekm∥∥2 − ‖zk+1/2 − zk‖2.

The inequality 2〈a, b〉 = ‖a+ b‖2 − ‖a‖2 − ‖b‖2 gives

‖zk+1 − z∗‖2 ≤ ‖zk − z∗‖2 − 2µF γ‖zk+1/2 − z∗‖2

+ 2(1− τ)〈wk − zk+1/2, zk+1/2 − z∗〉+ 2(1− τ)〈zk+1/2 − zk, zk+1/2 − z∗〉

+ 2γ2L2 · ‖zk+1/2 − wk‖2 +2

M

M∑m=1

∥∥ekm∥∥2 − ‖zk+1/2 − zk‖2

= ‖zk − z∗‖2 − 2µF γ‖zk+1/2 − z∗‖2

+ (1− τ)‖wk − z∗‖2 − (1− τ)‖wk − zk+1/2‖2 − (1− τ)‖zk+1/2 − z∗‖2

+ (1− τ)‖zk+1/2 − zk‖2 + (1− τ)‖zk+1/2 − z∗‖2 − (1− τ)‖zk − z∗‖2

+ 2γ2L2 · ‖zk+1/2 − wk‖2 +2

M

M∑m=1

∥∥ekm∥∥2 − ‖zk+1/2 − zk‖2

= τ‖zk − z∗‖2 + (1− τ)‖wk − z∗‖2 − 2µF γ‖zk+1/2 − z∗‖2 − (1− τ)‖wk − zk+1/2‖2

+ 4γ2L2‖wk − zk+1/2‖2 + 4γ2L2‖wk − wk‖2 +2

M

M∑m=1

∥∥ekm∥∥2 − τ‖zk+1/2 − zk‖2

= τ‖zk − z∗‖2 + (1− τ)‖wk − z∗‖2 − 2µF γ‖zk+1/2 − z∗‖2

− (1− τ − 4γ2L2)‖wk − zk+1/2‖2 + (2 + 4γ2L2)1

M

M∑m=1

∥∥ekm∥∥2 − τ‖zk+1/2 − zk‖2.

(31)

26


We can weight (31) by pk and get

K∑k=0

pk‖zk+1 − z∗‖2 ≤ τK∑k=0

pk‖zk − z∗‖2 + (1− τ)

K∑k=0

pk‖wk − z∗‖2 − 2µγ

K∑k=0

pk‖zk+1/2 − z∗‖2

− (1− τ − 4γ2L2)

K∑k=0

pk‖wk − zk+1/2‖2

+ (2 + 4γ2L2) ·K∑k=0

pk1

M

M∑m=1

∥∥ekm∥∥2 − τ K∑k=0

pk‖zk+1/2 − zk‖2. (32)

Next we will take an full expectation from the both side of previous inequality. Since wk+1 is chosen,

E[‖wk+1 − z∗‖2

]= τE

∥∥∥∥∥wk +1

M

M∑m=1

(ek+1m − ekm)− z∗

∥∥∥∥∥2+ (1− τ)E

[‖zk+1 − z∗‖2

]

≤ (1 + η)τE[∥∥wk − z∗∥∥2]+ (1 + 1/η)τE

∥∥∥∥∥ 1

M

M∑m=1

(ek+1m − ekm)

∥∥∥∥∥2

+ (1− τ)E[‖zk+1 − z∗‖2

],

with weighs it gives

K∑k=0

pkE[‖wk+1 − z∗‖2

]≤ (1 + η)τ

K∑k=0

pkE[∥∥wk − z∗∥∥2]+ (1− τ)

K∑k=0

pkE[‖zk+1 − z∗‖2

]+ (1 + 1/η)τ

K∑k=0

pkE

∥∥∥∥∥ 1

M

M∑m=1

(ek+1m − ekm)

∥∥∥∥∥2

≤ (1 + η)τ

K∑k=0

pkE[∥∥wk − z∗∥∥2]+ (1− τ)

K∑k=0

pkE[‖zk+1 − z∗‖2

]+ 2(1 + 1/η)τ

K∑k=0

pk · 1

M

M∑m=1

E[∥∥ek+1

m

∥∥2]+ 2(1 + 1/η)τ

K∑k=0

pk · 1

M

M∑m=1

E[∥∥ekm∥∥2]

27


Summing this one and (32), we getK∑k=0

pk(τ‖zk+1 − z∗‖2 + ‖wk+1 − z∗‖2

)≤

K∑k=0

pk(τ‖zk − z∗‖2 + (1 + τη)‖wk − z∗‖2

)− 2µF γ

K∑k=0

pk‖zk+1/2 − z∗‖2

− (1− τ − 4γ2L2)

K∑k=0

pk‖wk − zk+1/2‖2 +2(1 + 1/η)τ

p

K+1∑k=1

pk1

M

M∑m=1

∥∥ekm∥∥2+ (2 + 4γ2L2 + 2(1 + 1/η)τ) ·

K∑k=0

pk1

M

M∑m=1

∥∥ekm∥∥2 − τ K∑k=0

pk‖zk+1/2 − zk‖2

≤K∑k=0

pk(τ‖zk − z∗‖2 + (1 + τη)‖wk − z∗‖2

)− 2µF γ

K∑k=0

pk‖zk+1/2 − z∗‖2

− (1− τ − 4γ2L2)

K∑k=0

pk‖wk − zk+1/2‖2 − τK∑k=0

pk‖zk+1/2 − zk‖2

+

(2 + 4γ2L2 + 2(1 + 1/η)τ +

2(1 + 1/η)τ

p

)·K∑k=0

pk1

M

M∑m=1

∥∥ekm∥∥2 .Next we will estimate error term:

1

M

M∑m=1

∥∥ek+1m

∥∥2 =1

M

M∑m=1

∥∥∥ekm + γ · Fm(zk+1/2)− γ · Fm(wk)− Cdevm (γ · Fm(zk+1/2)− γ · Fm(wk) + ekm)

∥∥∥2≤ 1

M

M∑m=1

∥∥∥ekm + γ · Fm(zk+1/2)− γ · Fm(wk)− Cdevm (γ · Fm(zk+1/2)− γ · Fm(wk) + ekm)

∥∥∥2≤ (1− 1/δ)

M

M∑m=1

∥∥∥ekm + γ · Fm(zk+1/2)− γ · Fm(wk)∥∥∥2

≤ (1− 1/δ)

M

M∑m=1

(1 + ξ)∥∥ekm∥∥2 + (1 + 1/ξ)γ2 ·

∥∥∥Fm(zk+1/2)− Fm(wk)∥∥∥2 .

Here we use definition of biased compression and inequality ‖a+b‖2 ≤ (1+ξ)‖a‖2+(1+1/ξ)‖b‖2(for ξ > 0). With ξ = 1

2(δ−1) and γ ≤ 14δL , we get

1

M

M∑m=1

∥∥ek+1m

∥∥2 ≤ 1

M

M∑m=1

(1− 1/2δ)∥∥ekm∥∥2 + 2δγ2 ·

∥∥∥Fm(zk+1/2)− Fm(wk)∥∥∥2

≤ (1− 1/2δ)1

M

M∑m=1

∥∥ekm∥∥2 + 2δγ2L2 ·∥∥∥zk+1/2 − wk

∥∥∥2≤ (1− 1/2δ)

1

M

M∑m=1

∥∥ekm∥∥2 + 4δγ2L2 ·∥∥∥zk+1/2 − wk

∥∥∥2 + 4δγ2L2 · 1

M

M∑m=1

∥∥ekm∥∥2≤ (1− 1/2δ + 4δγ2L2)

1

M

M∑m=1

∥∥ekm∥∥2 + 4δγ2L2 ·∥∥∥zk+1/2 − wk

∥∥∥2≤ (1− 1/4δ)

1

M

M∑m=1

∥∥ekm∥∥2 + 4δγ2L2 ·∥∥∥zk+1/2 − wk

∥∥∥2≤ 4δγ2L2

k∑j=0

(1− 1/4δ)k−j ·∥∥∥zj+1/2 − wj

∥∥∥2 .28


We weigh the sequence as followsK∑k=0

pk · 1M

M∑m=1

∥∥ekm∥∥2, where p such as pk ≤ pj(1 + 1/8δ)k−j .

ThenK∑k=0

pk · 1

M

M∑m=1

∥∥ekm∥∥2 ≤ 4δγ2L2K∑k=0

pkk−1∑j=0

(1− 1/4δ)k−j−1 ·∥∥∥zj+1/2 − wj

∥∥∥2≤ 4δγ2L2

(1− 1/4δ)

K∑k=0

k−1∑j=0

pj(1 + 1/8δ)k−j(1− 1/4δ)k−j ·∥∥∥zj+1/2 − wj

∥∥∥2≤ 4δγ2L2

(1− 1/4δ)

K∑k=0

k−1∑j=0

pj(1− 1/8δ)k−j ·∥∥∥zj+1/2 − wj

∥∥∥2≤ 4δγ2L2

(1− 1/4δ)

K∑k=0

pk ·∥∥∥zk+1/2 − wk

∥∥∥2 · ∞∑j=0

(1− 1/8δ)j

≤ 128δ2γ2L2K∑k=0

pk ·∥∥∥zk+1/2 − wk

∥∥∥2 . (33)

(33) us to the finish line of proof:K∑k=0

pk(τ‖zk+1 − z∗‖2 + ‖wk+1 − z∗‖2

)≤

K∑k=0

pk(τ‖zk − z∗‖2 + (1 + τη)‖wk − z∗‖2

)− 2µF γ

K∑k=0

pk‖zk+1/2 − z∗‖2

− (1− τ − 4γ2L2)

K∑k=0

pk‖wk − zk+1/2‖2 − τK∑k=0

pk‖zk+1/2 − zk‖2

+

(2 + 4γ2L2 + 2(1 + 1/η)τ +

2(1 + 1/η)τ

p

)· 128δ2γ2L2

K∑k=0

pk ·∥∥∥zk+1/2 − wk

∥∥∥2 .With η = µF γ

4τ , γ ≤ 1µF

and p ≥ 1 we have

K∑k=0

pk(τ‖zk+1 − z∗‖2 + ‖wk+1 − z∗‖2

)≤ (1− µF γ/4)

K∑k=0

pk(τ‖zk − z∗‖2 + ‖wk − z∗‖2

)−(

1− τ − µF γ − 800δ2γ2L2 − 512δ2γ4L4δ2 − 2048δ2τ2γL2

µF

)·K∑k=0

pk ·∥∥∥zk+1/2 − wk

∥∥∥2 .Choice γ ≤ min

(1−τ4µF

,√1−τ

60δL ,µF (1−τ)105τ2δ2L2

)gives

K∑k=0

pk(τ‖zk+1 − z∗‖2 + ‖wk+1 − z∗‖2

)≤ (1− µF γ/4)

K∑k=0

pk(τ‖zk − z∗‖2 + ‖wk − z∗‖2

).

Then we just need to take p = 1/(1−µF γ/4) (easy to check that pk ≤ pj(1 + 1/8δ)k−j works) andget(

τ‖zK+1 − z∗‖2 + ‖wK+1 − z∗‖2)≤ (1− µF γ/4)K+1

(τ‖z0 − z∗‖2 + ‖w0 − z∗‖2

).

This ends our proof.

29


D MOTIVATING EXAMPLES

Let us motivate the utility of considering VIs in machine learning on a handful of examples.

Lagrangian multipliers and SVM. Lagrange multipliers are a standard approach to solving con-strained optimization problems. This technique reduces the original problem to a saddle pointproblem. This approach is one of the basic and classic for SVM (Shalev-Shwartz & Ben-David,2014):

minw,b

maxλ

1N

N∑n=1

λn(yn(〈w, xn〉+ b)− 1) + β2 ‖w‖

2, (34)

where w are the weights of the model, b – some number, (xn, yn)Nn=1 are pairs of the training dataand labels, and β ≥ 0 is a regularization parameter.

GANs. A simple GAN setup consists of two parts: the discriminator D aimed at distinguishingreal samples x from adversarial ones by giving probability that a sample is real, and the generatorG trying to fool the discriminator by generating realistic samples from random noise z. FollowingGoodfellow et al. (2014), the value function V (G,D) used in such a minimax game can be expressedin a saddle point form as

minG

maxD

V (D,G) := Ex∼pdata(x)[logD(x)] + Ez∼pz(z)[log(1−D(G(z)))]. (35)

Adversarial loss. To force a model to be more stable and robust, it can be trained in a constructiveway, for example, by introducing adversarial noise (Madry et al., 2017; Nouiehed et al., 2019). Forexample, the approach of Liu et al. (2020); Zhu et al. (2019) works well in NLP. From the point ofview of theory, this latter approach reduces to the saddle point problem

minw

max‖ρ1‖≤e,...‖ρN‖≤e

1

N

N∑n=1

l(f(w, xn + ρn, yn)2 +λ

2‖w‖2 − β

2‖ρ‖2, (36)

where w are the weights of the model, (xn, yn)Nn=1 are pairs of the training data and labels, ρ isthe so-called adversarial noise which introduces a perturbation in the data, and λ > 0 and β > 0 arethe regularization parameters. The main difference from a standard approach is in explicit training ofρ so that the noise from it is harmful, and for w to adapt to this noise.

Online transport and Wasserstein barycenters Online transport or Wasserstein Barycenter (WB)problem can be rewritten as a saddle point problem (Dvinskikh & Tiapkin, 2021). This representationcomes from the dual view on the transportation polytope.

30

A arXiv:2110.03313v1 [cs.LG] 7 Oct 2021

Documents

Transcript of A arXiv:2110.03313v1 [cs.LG] 7 Oct 2021