Toward Efﬁcient Federated Learning in Multi-Channeled ...

Toward Efficient Federated Learning in Multi-Channeled Mobile Edge Networkwith Layerd Gradient Compression

Haizhou Du1, Xiaojie Feng1, Qiao Xiang2*, Haoyu Liu2

1Shanghai University of Electric Power,2Xiamen University

Abstract

A fundamental issue for federated learning (FL) is how toachieve optimal model performance under highly dynamiccommunication environments. This issue can be alleviatedby the fact that modern edge devices usually can connectto the edge FL server via multiple communication channels(e.g., 4G, LTE and 5G). However, having an edge devicesend copies of local models to the FL server along multi-ple channels is redundant, time-consuming, and would wasteresources (e.g., bandwidth, battery life and monetary cost).In this paper, motivated by the layered coding techniques invideo streaming, we propose a novel FL framework calledlayered gradient compression (LGC). Specifically, in LGC,local gradients from a device is coded into several layers andeach layer is sent to the FL server along a different channel.The FL server aggregates the received layers of local gra-dients from devices to update the global model, and sendsthe result back to the devices. We prove the convergence ofLGC, and formally define the problem of resource-efficientfederated learning with LGC. We then propose a learning-based algorithm for each device to dynamically adjust its lo-cal computation (i.e., the number of local stochastic descent)and communication decisions (i.e., the compression level ofdifferent layers and the layer-to-channel mapping) in each it-eration. Results from extensive experiments show that usingour algorithm, LGC significantly reduces the training time,improves the resource utilization, while achieving a similaraccuracy, compared with well-known FL mechanisms.

1 IntroductionFederated learning (FL) has emerged as an efficient solu-tion to analyze and process distributed data for data-driventasks (e.g., autonomous driving, virtual reality, image classi-fication, etc.) in Mobile Edge Computing (MEC) (Niknam,Dhillon, and Reed 2020; Li et al. 2021; Verma, Julier,and Cirincione 2018; Wang et al. 2018; Yang et al. 2019).By performing training tasks at edge devices (e.g., mobilephones and tablets) and aggregating the learned parametersat edge servers, FL significantly reduces the network band-width usage of machine learning applications, and protectsthe data privacy of edge devices (Bonawitz et al. 2019).

However, to practically deploy FL in edge networks stillfaces several difficulties. 1) The communication between de-vices and the server in dynamic edge networks may be fre-

*These authors contributed equally.

quently unavailable, slow, and expensive. 2) The resources(e.g., bandwidth and battery life) are always limited in theMEC system.

These issues can be alleviated by the fact that modernedge devices usually can connect to the edge FL server viamultiple communication channels (e.g., 4G, LTE and 5G).However, having an edge device for sending copies of lo-cal models to the FL server along multiple channels is re-dundant, time-consuming, and would waste resources (e.g.,bandwidth, battery life and monetary cost).

Several pioneering works have been proposed to managesystem resources for efficient FL in edge networks (Wanget al. 2019; Tran et al. 2019; Chen et al. 2020). However,these studies focus on reducing resource consumption, hin-dering performance boost in resource utilization and trainingefficiency. A promising solution suggested in recent works isto incorporate gradient compression strategies into FL algo-rithms, which can considerably reduce the communicationcost with little impact on learning outcomes (Stich, Cordon-nier, and Jaggi 2018; Basu et al. 2019). However, these com-pression techniques are not tuned to the underlying commu-nication channel, and may not utilize the channel resourcesto the fullest.

In this paper, to address the problem of how to efficientlyutilize the limited resources at edge devices for optimallearning performance, we propose a novel FL frameworkcalled layered gradient compression (LGC). Motivated bythe layered coding techniques in video streaming, in LGC,local gradients from a device are coded into several layersand each layer is sent to the FL server along a differentchannel. The FL server aggregates the received layers of lo-cal gradients from devices to update the global model, andsends the result back to the devices. We integrate gradientcompression and multi-channel transmission into FL to al-leviate communication and energy bottleneck. We prove theconvergence of LGC, and formally define the problem ofresource-efficient FL with LGC. To deploy LGC in dynamicnetworks and resource constrained MEC systems, we thenpropose a learning-based algorithm for each device to dy-namically adjust its local computation (i.e., the number oflocal stochastic descent) and communication decisions (i.e.,the compression level of different layers and the layer-to-channel mapping) in each iteration.

Our main contributions of this paper are as follows:

arX

iv:2

109.

0881

9v1

[cs

.LG

] 1

8 Se

p 20

21

• To efficiently utilize the limited resources at edge devicesfor the optimal learning performance in dynamic edgenetworks, motivated by the layered coding techniquesin video streaming, we propose a novel FL frameworkcalled layered gradient compression (LGC). To the bestof our knowledge, we are the first to propose such a lay-ered gradient compression FL framework.

• We provide a convergence guarantee for LGC from a the-oretical perspective, and formally define the problem ofresource-efficient FL with LGC.

• We then propose a learning-based control algorithm foreach device to dynamically adjust its local computationand communication decisions in each iteration, subject todynamic edge network and resource constraints.

• We evaluate the performance of LGC with the proposedlearning-based control algorithm. Results show that us-ing our algorithm, LGC significantly reduces the trainingtime, improves resources utilization, while achieving asimilar accuracy, compared with the baseline.

The rest of this paper is organized as follows. In Section2, we describe the framework of LGC, prove the conver-gence of LGC and define the problem of resource-efficientFL with LGC. In Section 3, we describe the design and im-plementation details of the learning-based control algorithm.We show the experimental results in Section 4, summary therelated work in Section 5, and conclude this work in Section6.

2 Framework DesignThis section first reviews the typical framework of FL. Then,we describe our proposed LGC mechanism and prove itsconvergence. Finally, we put forward the problem formu-lation of resource-efficient FL with LGC.

2.1 Framework OverviewThe framework of LGC follows the typical FL pattern andconsists of two parts, an edge server andM devices. In LGC,M edge devices denoted by M = {1, 2, . . . ,m, . . . ,M}collaboratively train a learning model with an edge serverby iterative computations and communication.

To alleviate the communication bottleneck, LGC com-presses the local computed gradients before transmitting andsends them through multiple channels. Figure 1 gives andoverview of LGC. In LGC, each device computes the localgradients ( 1©), compress the gradients by LGC compressor( 2©) and sends encoded layers of the compressed gradientsto the edge server through multiple channels ( 3©). The serverwaits until the gradients from all the clients are received. Itthen adds them up ( 4©) and dispatches the results to all de-vices ( 5©). Devices then uses them to update the local model.Multiple channels are indicated by different colors.

To compress the gradients, we consider Topk operator, animportant example of sparsification operators in distributedtraining. And we extend it to LGCk for multiple commu-nication channels. For any x ∈ RD, Topk(x) ∈ RD isequal to a D-length vector, which has at most k non-zerocomponents whose indices correspond to the indices of the

CH1 CH2 CH3CH1 CH2 CH3 CH1 CH2 CH1 CH2

...

Edge Devices

Edge Server

... ... 1

2

3

4

5

CH4CH3

Raw GradientsRaw Gradients Raw Gradients Raw Gradients

Compressed Gradients

Figure 1: An overview of LGC.

largest k components (in absolute value) of x. Before giv-ing the defination of LGCk, we extend Topk compressor toTopα,β (1 ≤ α < β ≤ D) compressor to take the sparsi-fied top-(α, β) gradients. Specifically, for a vector x ∈ RD,Topα,β(x) ∈ RD and the i-th (i = 1, 2, . . . , D) element ofTopα,β(x) is defined as

Topα,β(xi) =

{xi, if thrα ≥ |xi| > thrβ ,0, otherwise,

(1)

where xi is the i-th element of x and thrα is the α-th largestabsolute value of the elements in x and and thrβ is the β-thlargest absolute value of the elements in x.

Modern edge devices usually can connect with multi-ple communication channels. Considering a device with Cchannels connected to it, the traffic allocation among thesechannels is denoted by a vector k ∈ Rc. The device codesgradient elements into different layers with Topα,β compres-sor and gets {Top∑c−1

i=1 ki,∑c

i=1 ki(x)}Cc=1. Then each layer

is sent to server through different channels. The server col-lects gradients from all the channels, decodes them and getsLGCk(x). For a vector x ∈ RD, LGCk(x) ∈ Rd and thei-th (i = 1, 2, . . . , d) element of LGCk(x) is defined as

LGCk(x) =

C∑c=1

Top∑ci=1 ki,kc

(x). (2)

Unlike previous studies requiring an identical number oflocal computation and compression level across all the par-ticipants, we propose and analyze a particular form of asyn-chronous operation where the devices synchronize with themaster at arbitrary times. We also allow the participating de-vices to perform gradient sparsification with different com-pression coefficients. This indeed helps to accommodatestragglers with poor channel conditions and thus mitigatesthe impacts of stale updates. By definition, we also allowdevices to be equipped with different numbers and types ofcommunication channels.

Let Im ⊆ T := {1, . . . , T} with T ∈ Im denotea set of indices for which device m ∈ M synchronizeswith the server. In our asynchronous setting, Im’s may bedifferent for different devices. However, we assume that

2

Algorithm 1: FL with Layered Gradient Compression

1: Initialize w(0), ¯w(0),w(0)m , w

(0)m , e

(0)m ,∀m ∈ M. Sup-

pose η(t) follows a certain learning rate schedule.2: for t = 0toT − 1 do3: On Edge Devices:4: for m ∈M in parallel do5: Sampling a mini-batch D(t)

m of size b from Dm6: w

(t+ 12 )

m ← w(t)m − η(t)∇fm(w

(t)m ;D(t)

m );7: if t+ 1 ∈ Im then8: u

(t)m ← e

(t)m + w

(t)m − w

(t+ 12 )

m

9: g(t)m = LGC(t)

m (u(t)m )

10: Upload g(t)m by multiple channels

11: e(t+1)m ← e

(t)m + w

(t)m − w

(t+ 12 )

m − g(t)m

12: Receive ¯w(t+1)

13: w(t+1)m ← ¯w(t+1) and w

(t+1)m ← ¯w(t+1)

14: else15: w

(t+1)m ← w

(t+ 12 )

m

16: w(t+1)m ← w

(t)m

17: e(t+1)m ← e

(t)m

18: At Central Server:19: if t+ 1 ∈ Im,∀m ∈M then20: Collect g

(t)m , ∀m ∈ M and g(t) ←

1M

∑Mm=1 g

(t)m

21: ¯w(t+1) ← ¯w(t) − g(t) and broadcast ¯w(t+1)

22: else23: ¯w(t+1) ← ¯w(t)

24: Comment: w(t+ 12 )

m denotes an intermediate variable be-tween iterations t and t+ 1

gap(Im) ≤ H holds for every m ∈ M, which means thatthere is a uniform bound on the maximum delay in each de-vice’s update times. Every device m ∈M maintains a localparameter vector w(t)

m which is updated in each iteration t. Ift ∈ Im, the error-compensated update g(t)

m computed on thenet progress made since the last synchronization is sent tothe server with multi-channel communication, and updatesits local memory e

(t)m . Upon receiving g

(t)m from every de-

vice m ∈ M which sent its gradients, master aggregatesthem, updates the global parameter vector, and sends thenew model w(t+1) to all the workers; upon receiving which,they set their local parameter vector w(t+1)

m to be equal tothe global parameter vector w(t+1). Our algorithm is sum-marized in Algorithm 1.

2.2 Convergence AnalysisWe consider the following two standard assumptions on thelocal loss functions fm : Rd → R,∀m ∈MAssumption 1. (Smoothness): fm(·) is L-smooth, i.e., forevery w,w′ ∈ Rd, we have

fm(w) ≤ fm (w′) + < ∇fm(w),w′−w > +L

2‖w′ −w‖2 .

(3)

Assumption 2. (Bounded variances and second momen-tum): For every w

(t)m ∈ Rd and t ∈ Z+, there exists con-

stants σ > 0 and G ≥ σ such that:

ED(t)m ⊂Dm

[‖∇fm

(w(t)m ;D(t)

m

)−∇fm

(w(t)m

)‖2]≤ σ2,∀m,

(4a)

ED(t)m ⊂Dm

[∥∥∥∇fm (w(t)m ;D(t)

m

)∥∥∥2] ≤ G2,∀m. (4b)

To state our results, we need the following definition from(Stich 2018).Definition 1. (Gap). Let I = {t0, t1, . . . , tk }, where ti <ti+1 for i = 0, 1, . . . , k − 1. The gap of I is defined asgap(I) := maxi∈{1,...,k} (ti − ti−1), which is equal to themaximum difference between any two consecutive synchro-nization indices.

We extent Lemma 4 in (Basu et al. 2019) and get the fol-lowing lemma.Lemma 1. (Memory contraction). Let gap (Im) ≤H,∀m ∈ M and η(t) = ξ

a+t , where ξ is a constant and

a > 4Hγ . Then there exists a constant C ≥ 4aγm(1−γ2

m)aγm−4H , the

following holds for every t ∈ Z+ and m ∈M:

E∥∥∥e(t)m ∥∥∥2

2≤ 4

(η(t))2

γ2mCH2G2. (5)

We leverage the perturbed iterate analysis as in (Maniaet al. 2015; Stich, Cordonnier, and Jaggi 2018) to provideconvergence guarantees for LGC. Under the above assump-tions, the following theorems hold for Algorithm 1.Theorem 1. (Smooth and strongly convex case with a de-caying learning rate). Let fm(w) be L-smooth and µ-strongly convex, ∀m ∈ M. Let {w(t)

m }T−1t=0 be generatedaccording to Algorithm 1 with C(t)m , for step sizes η(t) =8/µ(a + t) with gap(I) ≤ H , where a > 1 is such that wehave a > max{4H/γ, 32κ,H}, κ = L/µ. The followingholds

E[f(w(T ))]− f∗

≤ La3

4S‖w(0) −w∗‖22 +

8LT (T + 2a)

µ2SA+

128LT

µ3SB,

(6)where

C = minm∈M

4aγm(1− γ2m)

aγm − 4H, (7a)

C1 =192

M

M∑m=1

(4− 2γm)(1 +C

γ2m), (7b)

C2 =8

M

M∑m=1

(4− 2γm)(1 +C

γ2m), (7c)

A =

∑Mm=1 σ

2m

bM2, (7d)

3

B = (3µ

2+ 3L)(

12CG2H2

γ2+ C1(η(t))2H4G2)

+ 24(1 + C2H2)LG2H2,

(7e)

w(T ) =1

S

T−1∑t=0

[s(t)

(1

M

M∑m=1

w(t)m

)]=

1

S

T−1∑t=0

s(t)w(t),

(7f)s(t) = (a+ t)2, (7g)

S =

T−1∑t=0

s(t) ≥ T 3

3. (7h)

Corollary 1. For gap(Im) ≤ H , a >

max{4H/γ, 32κ,H}, σmax = maxm∈M σm, if {w(t)m }T−1t=0

is generated according to Algorithm 1 and usingE‖w(0) − w∗‖22 ≤ 4G2

µ2 from Lemma 2 in (Rakhlin,Shamir, and Sridharan 2011), we have

E[f(x(T ))]− f∗ ≤O(G2H3

µ2γ3T 3

)+O

(σ2

max

µ2bRT+

Hσ2max

µ2bRγT 2

)+O

(G2

µ3γ2T 2(H2 +H4)

).

(8)

2.3 Problem FormulationIn this part, we define resource-efficient FL with LGC. Con-sidering the resources of different mobile devices varies, weformulate the optimization problem to minimize global lossfunction under resource constraints as follows.

min{T,H(t)

m ,D(t)m,n}

f(w(T )), (9)

subject to,

T∑t=1

(E(t)m,r,compH

(t)m +

N∑n=1

E(t)m,r,commD

(t)m,n

)≤ Bm,r,

∀m ∈M,∀r ∈ R(10a)

N∑n=1

D(t)m,n ≤ D,∀m ∈M,∀t ∈ T , (10b)

H(t)m ≤ H,∀m ∈M,∀t ∈ T , (10c)

where E(t)m,r,comp is the total resource consumption for lo-

cal computation of device m for resource r in round t andE

(t)m,r,comm is the resource consumption factor for communi-

cation of device m for resource r in round t. H(t)m represents

the number of local update steps at device m in round t.D

(t)m,n indicates the traffic allocation for channel n at device

m in round t. Bm,r represents the total budget for resourcer in device m.

Since FL is typically deployed in highly dynamic edgenetworks, a learning-based method could be useful to adap-tively adjust the local computation and communication deci-sion, while satisfying the resource constraints at each epochin MEC.

3 Control Algorithm DesignIn this section, we propose a learning-based control algo-rithm for LGC to achieve resource-efficient FL. We firstintroduce the workflow of the deep reinforcement learning(DRL) algorithm and then describe how to transform the for-mulated problem into a DRL procession.

3.1 Deep Reinforcement Learning MechanismDifferent from some traditional approaches using predefinedrules or model-based heuristics, the DRL based method aimsto learn a general action set based on the current system stateand the given reward. This is critical for deploying LGC ina highly dynamic environment.

The workflow of the DRL method is illustrated in Fig-ure 2. At each epoch t, for each device m, it measuresits state s

(t)m , computes the corresponding reward r(t)m , and

chooses its action a(t)m based on its policy π(t)m . After device

m updates its state to s(t+1)m at the next epoch t + 1, it puts

the tuple (s(t)m ,a

(t)m , r

(t)m , s

(t+1)m ) in a replay buffer for experi-

ence accumulation. A critic network then reads from the re-play buffer and updates the policy to π(t+1)

m together with theoptimizer. In particular, π(t+1)

m is updated with the goal of

maximizing the accumulative rewards R(t)m =

∞∑t=0

γ(t)m r

(t)m ,

where γ ∈ (0, 1] is a discount factor of future rewards.

Actor

Optimizer

Online PolicyNetwork

Target PolicyNetwork

Critic

Optimizer

Online QNetwork

Target QNetwork

ExperienceReplayBuffer

Environment

Agent

Figure 2: The workflow of the DRL algorithm.

3.2 Model DesignTo implement the formulated problem using DRL tech-niques, we first specify the state space, the action space andthe reward function as below.State Space. The state of each agent contains the currentresource consumption of each type of resource. We denotethe state space Sm = {s(t)m ,∀t ∈ T }. And we define s(t)m asfollows

s(t)m = (E(t)m,comm,E

(t)m,comp), (11)

4

where

E(t)m,comm = (E

(t)m,1,comm, · · · , E(t)

m,r,comm, · · · , E(t)m,R,comm),

(12a)

E(t)m,comp = (E

(t)m,1,comp, · · · , E(t)

m,r,comp, · · · , E(t)m,R,comp).

(12b)The state variables are described as follows.

• E(t)m,r,comm represents consumption factor for communi-

cation of resource r at device m in round t.• E(t)

m,r,comp represents total consumption for local compu-tation of resource r at device m in round t.

Action Space. Each device m has an action space denotedas Am =

{a(t)m ,∀t ∈ T

}. On receiving state s(t)m , the agent

m needs to choose its local computation and communicationdecisions . Specifically, an action can be represented as

a(t)m = (H(t)m ,D(t)

m ), (13)

where D(t)m = (D

(t)m,1, · · · , D

(t)m,n, · · · , D(t)

m,N ).The action variables are described as follows.

• H(t)m represents the number of local iterations at device

m in round t.• D(t)

m,n represents the number of gradient entries sentthrough channel n at device m in round t.

Reward Function. At each training epoch t, the agent mwill get a reward r(s

(t)m ; a

(t)m ; s

(t+1)m ) under a certain state

s(t)m after executing action a

(t)m . The objective function of

this work is to minimize the global loss function ε(T ) =∑Mm=1 ε

(T )m under resource constraints. Hence, we minimize

ε(T )m for each device m under its resource constraints. We

first define the utility function over resource r at device min iteration t as follows:

U (t)m,r =

δ(t)

ε(t)m,r

, (14)

where

δ(t)m = ε(t)m − ε(t−1)m , (15a)

ε(t)m,r =

(E(t)m,r,compH

(t)m +

N∑n=1

E(t)m,r,commD

(t)m,n

). (15b)

Then we define the reward function as the weighted aver-aging utility function over R types of resources at device min iteration t as follows:

r(s(t)m ; a(t)m ; s(t+1)m ) =

R∑r=1

αrU

(t+1)m,r

U(t)m,r

, (16)

where αr is the weight of utility function U (t)m,r.

3.3 DRL Algorithm DetailsIn our framework, each device dynamically decides its num-ber of local iterations, gradient compression ratio and traf-fic allocation among different channels based on the state-of-the-art Deep Deterministic Policy Gradient (DDPG) al-gorithm (Lillicrap et al. 2015). Specifically, the algorithmmaintains a parameterized critic function and actor function.As shown in Fig. 2, the critic functionQ(s

(t)m ,a

(t)m |θQm) is im-

plemented by a Deep Q-Network (DQN) where θQm denotesthe weight vector of DQN. The actor function π(s

(t)m |θπm)

is implemented by DNN where θπm is the weight vector ofthe DNN. If the agent under state s

(t)m take an action a

(t)m at

epoch t, theQ value of the critic function will be returned asfollows

Q(s(t)m ,a(t)m ) = E[Rm|s(t)m ,a(t)m

], (17)

where Rm =T∑k=t

γ(t)m r(s

(t)m ,a

(t)m ). Let y(t)m be the target

value at epoch t. It can be evaluated as

y(t)m = r(s(t)m ,a(t)m )

+ γ(t)m Q(s(t+1)m , π(s(t+1)

m |θπm)|θQm),(18)

where γ(t)m denotes the discount factor for future rewards atedge device m at epoch t.

4 EvaluationWe describe the implementation of LGC and verify its per-formance in this section. We first clarify our environmentsettings, and then show the experimental results.

4.1 Experiment SettingsBaselines. To illustrate the effectiveness of LGC, we imple-ment LGC with a learning-based resource-efficient controlalgorithm, and we compare them with the following base-line FL mechanisms.

• FedAvg (McMahan et al. 2017) performs a fixed numberof local computation in each round and aggregates themodels in a centralized and synchronous paradigm.

• LGC without DRL performs fixed number of local com-putations and makes the same communication decisionsfor each round. We use this as a baseline to show the ben-efits of the learning-based control algorithm.

Datasets and Models. The experiments are conducted overthree different models (i.e.., LR, CNN and RNN which areimplemented by open source FedML framework (He et al.2020)) and two real datasets (i.e., MNIST and Shakespeare).

• LR (Gortmaker 1994) and CNN (Albawi, Mohammed,and Al-Zawi 2017) are trained over MNIST (LeCun et al.1998), which is composed of 60,000 handwritten digitsfor training and 10,000 for testing.

• RNN is trained over Shakespeare. Shakespeare includes40,000 lines from a variety of Shakespeare’s plays.

5

Performance Metrics. In our experiments, we mainly adoptthe following metrics to evaluate the performance of our pro-posed framework.

• Training loss measures the difference between the pre-dicted values and the actual values. The performance ofboth the DRL and the training models are all evaluated.

• Reward of DRL is the return of the reward function dur-ing one DRL training episode.

• Model accuracy is the proportion of correctly classifiedsamples to all samples in the dataset.

• Energy consumption caused by local computation andcommunication, which indicates the battery usage.

• Money cost denotes the money spent for the training pro-cedure.

Hyperparameters Setting. For all experiments, we set thelearning rate and batch size as 0.01 and 64. By default, weemploy 3 devices and consider 3 different communicationchannels for FL. To quantify the energy cost for differentchannels, we adopt a Gaussian distribution with mean andstandard deviation values (Wang et al. 2019). The parame-ters of this distribution are given in Table 1.

Table 1: Energy consumption for different communcationchannels.

Channel Type Mean (J/MB) Standard Deviation3G 1296 0.000334G 2.2× 1296 0.000335G 2.5× 2.2× 1296 0.00033

4.2 Experiment ResultsResults of DRL Training. The DRL training is conductedsimultaneously with the FL procedure. In Figure 5(a), wefirst observe the change of loss with the increasing episodein DRL. The loss decreases quickly in the earlier stages ofDRL training, because the DRL agent has no informationabout network condition and the FL training leads to a largetraining loss. Thanks to the efficient exploration and the ex-perience replay, the reward will rapidly decrease with themodel training procession. Figure 5(b) shows the changeof reward. Specifically, the reward value increases with theepochs, because the DRL model can learn a better policy toachieve a better reward.Results of Performance. We compare LGC to baselineswith different datasets and models. The convergence curvesof loss and model accuracy are shown in the first two plots ofFigure 3, Figure 4 and Figure 6. We can find that LGC con-vergences with a similar rate with the baselines and LGChas very little impact on the best model accuracy. We alsocompare LGC to baselines with energy and money bud-gets. By the results from the last two plots of Figure 3,Figure 4 and Figure 6, LGC can greatly reduce the energyand money when achieving the target accuracy. The reasonfor the significant performance improvement of LGC underthe resource budgets is that LGC performs communication

compression and employs multi-channel communication be-tween edge nodes and the edge server, and the DRL basedcontrol algorithm can dynamically adjust its local computa-tion and communication decisions.

5 Related WorksThese unique characteristics of FL lead to mainly two prac-tical issues in FL implementation, i.e., (i) communicationcost (ii) resource allocation. In this section, we review re-lated works that address each of these issues.

5.1 Communication CostLocal Computation. Some recent works propose to per-form more computation on edge nodes before each globalaggregation to reduce the number of communication roundsneeded for the model training (McMahan et al. 2017; Yao,Huang, and Sun 2018; Liu et al. 2020). However, these ap-proaches may increase computation cost and delay conver-gence if global aggregation is too infrequent. The tradeoffbetween these sacrifices and communication cost reductionthus has to be well-managed.Gradient Compression. To reduce the traffic per commu-nication round instead of the number of communicationrounds, some other works let each participant communicatethe compressed gradients rather than raw gradients for ev-ery global synchronization by quantization (Wen et al. 2017;Alistarh et al. 2017) or sparsification (Wangni et al. 2017;Stich, Cordonnier, and Jaggi 2018; Basu et al. 2019). How-ever, these studies often ignored the heterogeneity amongmobile devices (e.g., in computing capabilities and com-munication bandwidth) and required identical compressionlevels across all the participants and thereby exhibiting lessflexibility.

5.2 Resource AllocationAdaptive Aggregation. In recent works, adaptive adjust-ment of global aggregation frequency has been investigatedto increase training efficiency subject to resource constraints(Sprague et al. 2018; Wang et al. 2019). (Sprague et al. 2018)proposed asynchronous FL where model aggregation occurswhenever local updates are received by the FL server. (Wanget al. 2019) proposed to use adaptive global aggregation fre-quency based on resource constraints. While properly man-aging the system resources to enable FL in mobile edge net-works, these studies overlook reducing resource consump-tion intrinsically in the essence of learning algorithm itself,thus hindering the substantial boost in training efficiency andresource utilization.Joint Communication Techniques and Resource Man-agement. Even though computation capabilities of mo-bile devices have grown rapidly, many devices still face ascarcity of radio resources (Jordan, Lee, and Yang 2018).Given that local model transmission is an integral part ofFL, there has been a growing number of studies that focuson developing novel wireless communication techniques forefficient FL (Amiri and Gunduz 2020; Yang et al. 2020).However, signal distortion can lead to a drop in accuracy,

6

0 100 200 300 400 500Epochs

1.70

1.75

1.80

1.85

1.90

1.95

2.00Lo

ss

LGC with DRL (Proposed)LGC without DRL (Baseline)FedSGD (Baseline)

(a) Loss vs. epochs

0 100 200 300 400 500Epochs

0.5

0.6

0.7

0.8

Accu

racy


(b) Accuracy vs. epochs

102 103 104 105 106 107

Energy Consumption0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Accu

racy

LGC with DRL (Proposed)LGC without DRL (Baseline)

(c) Accuracy vs. energy con-sumption

0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14Communication Expence

0.60

0.65

0.70

0.75

0.80

0.85

0.90

Accu

racy


(d) Accuracy vs. money cost

Figure 3: Convergence curves of different mechanisms with LR on MNIST.

0 100 200 300 400 500Epochs

0.5

0.7

0.9

1.1

1.3

Loss


(a) Loss vs. epochs

0 100 200 300 400 500Epochs

0.5

0.6

0.7

0.8

Accu

racy



104 105 106

Energy Consumption0.0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Accu

racy



0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0Communication Expence

0.0

0.2

0.4

0.6

0.8

1.0

Accu

racy



Figure 4: Convergence curves of different mechanisms with CNN on MNIST.

0 200 400 600 800 1000Episodes

10

8

6

4

2

0

Loss

(a) Loss vs. episode

0 200 400 600 800 1000Episodes

0

10

20

30

40

50

60

70

80

Rewa

rd

(b) Reward vs. episode

Figure 5: Convergence curves of DRL training.

and the scalability is also an issue when large heteroge-neous networks are involved. On a higher level, wirelesstechnologies, such as IEEE 802.11a and 5G, provide multi-ple non-overlapping channels. The available network capac-ity can be increased by using multiple channels, and nodescan be equipped with multiple interfaces to utilize the avail-able channels.

6 ConclusionWe tackle the diverse resource utilization issue of FL byproposing LGC, a multi-channel transmission and commu-nication compression co-designed framework in a severelylimited MEC scenario. We analyze a convergence upperbound on LGC’ results and design a learning-based controlalgorithm for per-device to dynamically decide dynamicallyits local computation and communication decisions in eachepoch. The experimental results demonstrate that the LGCframework can perform better than baselines.

ReferencesAlbawi, S.; Mohammed, T. A.; and Al-Zawi, S. 2017. Un-derstanding of a convolutional neural network. In 2017International Conference on Engineering and Technology(ICET), 1–6. Ieee.Alistarh, D.; Grubic, D.; Li, J.; Tomioka, R.; and Vojnovic,M. 2017. QSGD: Communication-efficient SGD via gradi-ent quantization and encoding. In Advances in Neural Infor-mation Processing Systems, 1709–1720.Amiri, M. M.; and Gunduz, D. 2020. Federated learningover wireless fading channels. IEEE Transactions on Wire-less Communications, 19(5): 3546–3557.Basu, D.; Data, D.; Karakus, C.; and Diggavi, S. 2019.Qsparse-local-SGD: Distributed SGD with Quantization,

7

0 100 200 300 400 500Epochs

1.4

1.5

1.6

1.7

1.8

1.9

2.0

2.1

2.2Lo

ss


(a) Loss vs. epochs

0 100 200 300 400 500Epochs

0.35

0.40

0.45

0.50

0.55

0.60

Accu

racy



104 105 106

Energy Consumption

0.20

0.25

0.30

0.35

0.40

0.45

0.50

0.55

Accu

racy



0 2 4 6 8 10 12 14Communication Expence

0.1

0.2

0.3

0.4

0.5

0.6

Accu

racy



Figure 6: Convergence curves of different mechanisms with RNN on Shakespeare.

Sparsification and Local Computations. In Advances in Neu-ral Information Processing Systems, 14668–14679.Bonawitz, K.; Eichner, H.; Grieskamp, W.; Huba, D.; Inger-man, A.; Ivanov, V.; Kiddon, C.; Konecny, J.; Mazzocchi, S.;McMahan, H. B.; et al. 2019. Towards federated learning atscale: System design. arXiv preprint arXiv:1902.01046.Chen, M.; Poor, H. V.; Saad, W.; and Cui, S. 2020. Conver-gence time optimization for federated learning over wirelessnetworks. IEEE Transactions on Wireless Communications,20(4): 2457–2471.Gortmaker, S. L. 1994. Theory and methods–AppliedLogistic Regression by David W. Hosmer Jr and StanleyLemeshow. Contemporary sociology, 23(1): 159.He, C.; Li, S.; So, J.; Zeng, X.; Zhang, M.; Wang, H.; Wang,X.; Vepakomma, P.; Singh, A.; Qiu, H.; et al. 2020. Fedml: Aresearch library and benchmark for federated machine learn-ing. arXiv preprint arXiv:2007.13518.Jordan, M. I.; Lee, J. D.; and Yang, Y. 2018.Communication-efficient distributed statistical inference.Journal of the American Statistical Association.LeCun, Y.; Bottou, L.; Bengio, Y.; and Haffner, P. 1998.Gradient-based learning applied to document recognition.Proceedings of the IEEE, 86(11): 2278–2324.Li, Y.; Tao, X.; Zhang, X.; Liu, J.; and Xu, J. 2021. Privacy-Preserved Federated Learning for Autonomous Driving.IEEE Transactions on Intelligent Transportation Systems.Lillicrap, T. P.; Hunt, J. J.; Pritzel, A.; Heess, N.; Erez, T.;Tassa, Y.; Silver, D.; and Wierstra, D. 2015. Continuouscontrol with deep reinforcement learning. arXiv preprintarXiv:1509.02971.Liu, L.; Zhang, J.; Song, S.; and Letaief, K. B. 2020. Client-edge-cloud hierarchical federated learning. In ICC 2020-2020 IEEE International Conference on Communications(ICC), 1–6. IEEE.Mania, H.; Pan, X.; Papailiopoulos, D.; Recht, B.; Ramchan-dran, K.; and Jordan, M. I. 2015. Perturbed iterate analy-sis for asynchronous stochastic optimization. arXiv preprintarXiv:1507.06970.McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; andy Arcas, B. A. 2017. Communication-efficient learning of

deep networks from decentralized data. In Artificial intelli-gence and statistics, 1273–1282. PMLR.Niknam, S.; Dhillon, H. S.; and Reed, J. H. 2020. Feder-ated learning for wireless communications: Motivation, op-portunities, and challenges. IEEE Communications Maga-zine, 58(6): 46–51.Rakhlin, A.; Shamir, O.; and Sridharan, K. 2011. Makinggradient descent optimal for strongly convex stochastic op-timization. arXiv preprint arXiv:1109.5647.Sprague, M. R.; Jalalirad, A.; Scavuzzo, M.; Capota, C.;Neun, M.; Do, L.; and Kopp, M. 2018. Asynchronous feder-ated learning for geospatial applications. In Joint EuropeanConference on Machine Learning and Knowledge Discoveryin Databases, 21–28. Springer.Stich, S. U. 2018. Local SGD converges fast and communi-cates little. arXiv preprint arXiv:1805.09767.Stich, S. U.; Cordonnier, J.-B.; and Jaggi, M. 2018. Sparsi-fied SGD with memory. In Advances in Neural InformationProcessing Systems, 4447–4458.Tran, N. H.; Bao, W.; Zomaya, A.; Nguyen, M. N.; andHong, C. S. 2019. Federated learning over wireless net-works: Optimization model design and analysis. In IEEEINFOCOM 2019-IEEE Conference on Computer Commu-nications, 1387–1395. IEEE.Verma, D.; Julier, S.; and Cirincione, G. 2018. Federatedai for building ai solutions across multiple agencies. arXivpreprint arXiv:1809.10036.Wang, S.; Tuor, T.; Salonidis, T.; Leung, K. K.; Makaya,C.; He, T.; and Chan, K. 2018. When edge meets learning:Adaptive control for resource-constrained distributed ma-chine learning. In IEEE INFOCOM 2018-IEEE Conferenceon Computer Communications, 63–71. IEEE.Wang, S.; Tuor, T.; Salonidis, T.; Leung, K. K.; Makaya,C.; He, T.; and Chan, K. 2019. Adaptive federated learn-ing in resource constrained edge computing systems. IEEEJournal on Selected Areas in Communications, 37(6): 1205–1221.Wangni, J.; Wang, J.; Liu, J.; and Zhang, T. 2017. Gradientsparsification for communication-efficient distributed opti-mization. arXiv preprint arXiv:1710.09854.

8

Wen, W.; Xu, C.; Yan, F.; Wu, C.; Wang, Y.; Chen, Y.; andLi, H. 2017. Terngrad: Ternary gradients to reduce commu-nication in distributed deep learning. In Advances in neuralinformation processing systems, 1509–1519.Yang, K.; Jiang, T.; Shi, Y.; and Ding, Z. 2020. Federatedlearning via over-the-air computation. IEEE Transactionson Wireless Communications, 19(3): 2022–2035.Yang, Q.; Liu, Y.; Chen, T.; and Tong, Y. 2019. Federatedmachine learning: Concept and applications. ACM Transac-tions on Intelligent Systems and Technology (TIST), 10(2):1–19.Yao, X.; Huang, C.; and Sun, L. 2018. Two-stream federatedlearning: Reduce the communication costs. In 2018 IEEEVisual Communications and Image Processing (VCIP), 1–4.IEEE.

7 AppendixInspired by the perturbed iterate analysis framework (Maniaet al. 2015), we define virtual sequences for every devicem ∈M and for all t ≥ 0 as follows:

w(0)m := w(0)

m w(t+1)m := w(t)

m − η(t)∇fm(w(t)m ;D(t)

m )(19)

We also define

q(t) :=1

M

M∑m=1

∇fm(w(t)m ;D(t)

m )

q(t) := ED(t)M

[q(t)] =1

M

M∑m=1

∇fm(w(t)m )

w(t+1) :=1

M

M∑m=1

w(t+1)m = w(t) − η(t)q(t)

w(t) :=1

M

M∑m=1

w(t)m

Im ={t(i)m : i ∈ Z+, t(i)m ∈ T ,

∣∣∣t(i)m − t(j)m ∣∣∣ ≤ H,∀|i− j| ≤ 1}

(20)

7.1 Proof of Theorem 1Proof. Let w∗ be the minimizer of f(w), therefore wehave ∇f (w∗) = 0. We denote f (w∗) by f∗. By tak-ing the average of the virtual sequences w

(t+1)m = w

(t)m −

η(t)∇fm(w(t)m ;D(t)

m ) for each worker m ∈ M and definingq(t) := 1

M

∑Mm=1∇fm(w

(t)m ;D(t)

m ), we get

w(t+1) = w(t) − η(t)q(t) (21)

Define D(t)M as the set of random sampling of the mini-

batches at each worker{D(t)

1 ,D(t)2 , . . . ,D(t)

M

}and let

q(t) = ED(t)M

[q(t)] = 1M

∑Mm=1∇fm(w

(t)m ). From (21) we

can get

∥∥∥w(t+1) −w∗∥∥∥2 =

∥∥∥w(t) − η(t)q(t) −w∗ − η(t)q(t) + η(t)q(t)∥∥∥2

=∥∥∥w(t) −w∗ − η(t)q(t)

∥∥∥2+ (η(t))2

∥∥∥q(t) − q(t)∥∥∥2

− 2η(t)⟨w(t) −w∗ − η(t)q(t),q(t) − q(t)

⟩(22)

Taking the expectation w.r.t. the sampling D(t)M at time t

(conditioning on the past) and noting that last term in (22)becomes zero gives:

ED(t)M

∥∥∥w(t+1) −w∗∥∥∥2 =

∥∥∥w(t) −w∗ − η(t)q(t)∥∥∥2

+ (η(t))2∥∥∥q(t) − q(t)

∥∥∥2 (23)

It follows from the Jensen’s inequality and independencethat ED(t)

M‖q(t) − q(t)‖2 ≤

∑Mm=1 σ

2m

bM2 . This gives

ED(t)M

∥∥∥w(t+1) −w∗∥∥∥2 =

∥∥∥w(t) −w∗ − η(t)q(t)∥∥∥2

+ (η(t))2∑Mm=1 σ

2m

bM2

(24)

Now we bound the first term on the RHS. Using µ-strongconvexity and L-smoothness of f , together with some alge-braic manipulations provided in Lemma 14 in (Basu et al.2019), we arrive at

E‖w(t+1) −w∗‖22 ≤ (1− µη(t)

2)E‖w(t) −w∗‖22

− η(t)µ

2L(E[f(w(t)

)]− f∗)

+ η(t)(3µ

2+ 3L)E‖w(t) − w(t)‖22

+3η(t)L

M

M∑m=1

E‖w(t) − w(t)m ‖22

+ (η(t))2∑Mm=1 σ

2m

bM2

(25)Now we have to bound the deviation of local sequences

1M

∑Mm=1 E

∥∥∥w(t) − w(t)m

∥∥∥22

and the difference between the

virtual and true sequences E∥∥w(t) − w(t)

∥∥22. We show

these below in Lemma 2 and Lemma 3.Lemma 2. (Contracting local sequence deviation). Letgap (Im) ≤ H holds for every m ∈ M. For w

(r)t gener-

ated according to Algorithm 1 with decaying learning rateη(t) and letting w(t) = 1

M

∑Mm=1 w

(t)m , we have the follow-

ing bound on the deviation of the local sequences:

9

1

M

M∑m=1

E∥∥∥w(t) − w(t)

m

∥∥∥22≤ 8

(1 + C ′′H2

)(η(t))2G2H2

(26)where C ′′ = 8

M

∑Mm=1(4 − 2γm)

(1 + C

γ2m

)and C is a

constant satisfying C ≥ minm∈M4aγm(1−γ2

m)aγm−4H .

Lemma 3. (Contracting distance between virtual and truesequence). Let gap (Im) ≤ H holds for every m ∈ M. Ifwe run Algorithm 1 with a decaying learning rate η(t), thenwe have the following bound on the difference between thetrue and virtual sequences:

E∥∥∥w(t) − w(t)

∥∥∥22≤ C ′(η(t))2H4G2 + 12C

(η(t))2

γ2G2H2

(27)where C ′ = 192C ′′ = 8

M

∑Mm=1(4 − 2γm)

(1 + C

γ2m

)and C is a constant satisfying C ≥ minm∈M

4aγm(1−γ2m)

aγm−4H .

Substituting the bounds from (26)(27) into (25) yields

E‖w(t+1) −w∗‖22

≤ (1− µη(t)

2)E‖w(t) −w∗‖22

− η(t)µ

2Lε(t)

+ η(t)(3µ

2+ 3L)[C ′(η(t))2H4G2 + 12C

(η(t))2

γ2G2H2]

+

(3η(t)L

)M

[8(1 + C ′′H2

)(η(t))2G2H2]

+ (η(t))2∑Mm=1 σ

2m

bM2

(28)Employing a slightly modified result than Lemma 3.3

in (Stich, Cordonnier, and Jaggi 2018) with a(t) =

E∥∥w(t) −w∗

∥∥22, A =

∑Mm=1 σ

2m

bM2 and B = (3µ2 +

3L)( 12CG2H2

γ2 +C1(η(t))2H4G2)+24(1+C2H2)LG2H2,

we have

a(t+1) ≤(

1− µη(t)

2

)a(t)−µη

(t)

2Lε(t)+(η(t))2A+(η(t))3B.

For η(t) = 8µ(a+t) and s(t) = (a + t)2, S =

∑T−1t=0 s(t) ≥

T 3

3 , we have

µ

2LS

T−1∑t=0

s(t)ε(t) ≤ µa3

8Sa(0) +

4T (T + 2a)

µSA+

64T

µ2SB

From convexity, we can finally write

Ef(w(T )

)−f∗ ≤ La3

4Sa(0)+

8LT (T + 2a)

µ2SA+

128LT

µ3SB

Where w(T ) := 1S

∑T−1t=0

[s(t)

(1M

∑Mm=1 w

(t)m

)]=

1S

∑T−1t=0 s(t)w(t). This completes the proof of Theorem 1.

7.2 Proof of Lemma 2Proof. Fix a time t and consider any worker m ∈ M. Lettm ∈ Im denote the last synchronization step until time tfor the m’th worker. Define t′0 := minm∈M tm. We need

to upper-bound 1M

∑Mm=1 E

∥∥∥w(t) − w(t)m

∥∥∥2. Note that for

any M vectors u1, . . . ,uM , if we let u = 1M

∑Mi=1 ui, then∑n

i=1 ‖ui − u‖2 ≤∑Mi=1 ‖ui‖

2. We use this in the firstinequality below.

1

M

M∑m=1

E∥∥∥w(t) − w(t)

m

∥∥∥2=

1

M

M∑m=1

E∥∥∥w(t)

m − w(t′0) −

(w(t) − w

(t′0))∥∥∥2

≤ 1

M

M∑m=1

E∥∥∥w(t)

m − w(t′0)∥∥∥2

≤ 2

M

M∑m=1

E∥∥∥w(t)

m − w(tm)m

∥∥∥2 +2

M

M∑m=1

E∥∥∥w(tm)

m − w(t′0)∥∥∥2

(29)We bound both the terms separately. For the first term:

E∥∥∥w(t)

m − w(tm)m

∥∥∥2 = E

∥∥∥∥∥∥t−1∑j=tm

η(j)∇fD(j)m

(w(j)m

)∥∥∥∥∥∥2

≤ (t− tm)

t−1∑j=tm

E∥∥∥η(j)∇fD(j)

m

(w(j)m

)∥∥∥2≤ (t− tm)

2ηtmG

2

≤ 4(η(t))2H2G2

(30)The last inequality (30) uses η(tm) ≤ 2η(tm+H) ≤ 2η(t)

and t−tm ≤ H . To bound the second term of (38), note thatwe have

¯wtmm = ¯w(t′0) − 1

M

M∑s=1

tm−1∑j=t′0

1 {j + 1 ∈ Is} g(j)s (31)

Note that w(tm)m = w

(tm)m , because at synchronization

steps, the local parameter vector becomes equal to the globalparameter vector. Using this, the Jensen’s inequality, andthat ‖1{j+ 1 ∈ Is} g(j)s

∥∥2 ≤∥∥ g(j)s ‖2, we can upper-bound(31) as

E∥∥∥wtm

m −w(t′0)∥∥∥2 ≤ (tm − t′0)

M

M∑s=1

tm∑j=t′0

E∥∥∥g(j)s ∥∥∥2 (32)

10

Now we bound E∥∥∥g(j)s ∥∥∥2 for any j ∈ {t′0, . . . , tm} and

s ∈ M: Since E∥∥∥C(t)m (u)

∥∥∥2 ≤ B‖u‖2 holds for every u,

with B = (4− 2γm) 1, we have for any s ∈M that

E∥∥∥g(j)

s

∥∥∥2 ≤ BE∥∥∥e(j)s + w(j)

s − w(j+ 1

2 )s

∥∥∥2≤ 2BE

∥∥∥e(j)s ∥∥∥2 + 2BE∥∥∥w(j)

s − w(j+ 1

2 )s

∥∥∥2(33)

We can directly use Lemma 1 to bound the first term in

(33) as E∥∥∥e(s)j ∥∥∥2 ≤ 4C (η(j))2

γ2m

H2G2. In order to bound the

second term of (33), note that w(j)s = w

(ts)s , which implies

that∥∥∥w(j)

s − w(j+ 1

2 )s

∥∥∥2 =∥∥∥∑j

l=tsη(l)∇fD(l)

s

(w

(s)l

)∥∥∥2Taking expectation yields E

∥∥∥w(j)s − w

(j+ 12 )

s

∥∥∥2 ≤4(η(ts))2H2G2 ≤ 4(η(t

′0))2H2G2, where in the last

inequality we used that t′0 ≤ ts. Using these in (33) gives

E∥∥∥g(j)s ∥∥∥2 ≤ 8B

(1 +

C

γ2m

)(η(t

′0))2H2G2 (34)

Since t′0 ≤ t ≤ t′0 + H , we have η(t′0) ≤ 2η(t

′0+H) ≤

2η(t). Putting the bound on E∥∥∥g(j)

s

∥∥∥2 (after substituting

η(t′0) ≤ 2η(t) in (34)) in (32) gives

E∥∥∥w(tm)

m −w(t′0)∥∥∥2 ≤ 32

1

M[

M∑m=1

(4− 2γm)

(1 +

C

γ2m

)]

(η(t))2H4G2

(35)Putting this and the bound from (30) back in (29) gives

1

M

M∑m=1

E∥∥∥w(t) − w(t)

m

∥∥∥2≤ 8(η(t))2H2G2

+ 641

M[

M∑m=1

(4− 2γm)

(1 +

C

γ2m

)](η(t))2H4G2

≤ 8

[1 + 8[

1

M

M∑m=1

(4− 2γm)

(1 +

C

γ2m

)]H2

](η(t))2H2G2

(36)This completes the proof of Lemma 2.

7.3 Proof of Lemma 3Proof. Fix a time t and consider any device m ∈ M. Lettm ∈ Im denote the last synchronization step until time t

1This can be seen as follows: E‖C(t)m (u)‖2 ≤ 2E‖u −C(t)m (u)|2 + 2‖u‖2 ≤ 2(1− γm)‖u‖2 + 2‖u‖2

for the m’th device. Define t′0 := minm∈M tm. We wantto bound E

∥∥w(t) − w(t)∥∥2. By definition w(t) − w(t) =

1M

∑Mm=1

(w

(t)m − w

(t)m

). By the definition of virtual se-

quences and the update rule for w(t)m , we also have w(t) −

w(t) = 1M

∑Mm=1

(w

(tm)m − w

(tm)m

). This can be written as

w(t) − w(t) =

[1

M

M∑m=1

w(tm)m − ¯w(t′0)

]+[

¯w(t′0) − ¯w(t)]

+

[¯w(t) − 1

M

M∑m=1

w(tm)m

](37)

Applying Jensen’s inequality and taking expectation gives

E∥∥∥w(t) − w(t)

∥∥∥2 ≤ [ 3

M

M∑m=1

E∥∥∥w(tm)

m − ¯w(t′0)∥∥∥2]

+

[3E∥∥∥ ¯w(t′0) − ¯w(t)

∥∥∥2]

+

3E

∥∥∥∥∥ ¯w(t) − 1

M

M∑m=1

w(tm)m

∥∥∥∥∥2(38)

We bound each of the three terms of (38) separately. Wehave upper-bounded the first term earlier in (38), which is

E∥∥∥w(tm)

m − ¯w(t′0)∥∥∥2 ≤ 32B

(1 +

C

γ2

)(η(t))2H4G2

(39)where B = (4 − 2γ). To bound the second term of (38),

note that

¯w(t) = ¯w(0) − 1

M

M∑m=1

tm−1∑j=0

1 {j + 1 ∈ Im}g(t)m

= ¯w(t′0) − 1

M

M∑m=1

tm−1∑j=t′0

1 {j + 1 ∈ Im}g(j)m

(40)

By applying Jensen’s inequality, using∥∥∥1 {j + 1 ∈ Im}g(j)m

∥∥∥2 ≤∥∥∥g(j)

m

∥∥∥2, and taking ex-pectation, we can upper-bound (40) as

E∥∥∥ ¯w(t′0) − ¯w(t)

∥∥∥2 ≤ (tm − t′0)

M

M∑m=1

tm∑j=t′0

E∥∥∥g(j)

m

∥∥∥2 (41)

Using the bound on E∥∥∥g(j)

m

∥∥∥2’s from (35) gives

E∥∥∥ ¯w(t′0) − ¯x(t)

∥∥∥2 ≤ 32B

(1 +

C

γ2

)(η(t))2H4G2 (42)

11

To bound the last term of (38), note that

w(tm)m = ¯w(0) −

tm−1∑j=0

(η(j))∇fD(j)m

(w(j)m

)(43)

From (40) and (43), we can write

¯w(t) − 1

M

M∑m=1

w(tm)m

=1

M

M∑m=1

tm−1∑j=0

η(j)∇fD(j)m

(w(j)m

)−tm−1∑j=0

1 {j + 1 ∈ Im}g(j)m

(44)

Let t(1)m and t(2)m be two consecutive synchronization steps

in Im. Then, by the update rule of w(t)m , we have w

(t(1)m )m −

w(t(2)m − 1

2 )m =

∑t(2)m −1j=t

(1)m

∇fD(j)m

(w

(j)m

). Since w

(t(1)m )m =

w(t(1)m )m and the devices do not modify their local w(t)

m ’s in

between the synchronization steps, we have w(t(2)m −1)m =

w(t(1)m )m = w

(t(1)m )m . Therefore, we can write

w(t(2)m −1)m − w

(t(2)m − 12 )

m =

t(2)m −1∑j=t

(1)m

∇fD(j)m

(w(j)m

)(45)

Using (45) for every consecutive synchronization steps,we can equivalently write (44) as

¯w(t) − 1

M

M∑m=1

w(tm)m

=1

M

M∑m=1

∑j:j+1∈Im,j≤tm−1

(w(j)m − w

(j+ 12 )

m − g(j)m

)=

1

M

M∑m=1

e(tm)m

=1

M

M∑m=1

e(t)m

(46)In the last inequality, we used the fact that the devices

do not update their local memory in between the synchro-nization steps. For the reasons given in the proof of Lemma2, we can directly apply Lemma 4 in (Basu et al. 2019) to

bound the local memories and obtain E∥∥∥ 1M

∑Mm=1 e

(t)m

∥∥∥2 ≤1M

∑Mm=1 E

∥∥∥e(t)m ∥∥∥2 ≤ 4C (η(t))2

γ2 G2H2. This implies

E

∥∥∥∥∥ ¯w(t) − 1

M

M∑m=1

w(tm)m

∥∥∥∥∥2

≤ 4C(η(t))2

γ2G2H2 (47)

Putting the bounds from (89), (92), and (97) in (88) andusing B = (4− 2γ) give

E∥∥∥w(t) − w(t)

∥∥∥2 ≤ 192(4− 2γ)

(1 +

C

γ2

)(η(t))2H4G2

+ 12C(η(t))2

γ2G2H2

(48)This completes the proof of Lemma 3 .

12

Toward Efﬁcient Federated Learning in Multi-Channeled ...

Documents

Transcript of Toward Efﬁcient Federated Learning in Multi-Channeled ...