Federated Learningcs294-163/fa19/slides/... · 2019-10-02 · Updates of M(i) Updates of M(i)...

Federated LearningMin Du

Postdoc, UC Berkeley

Outline

q Preliminary: deep learning and SGD

q Federated learning: FedSGD and FedAvg

q Related research in federated learning

q Open problems

Outline

q Open problems

• Find a function, which produces a desired output given a particular input.

Example task Given input Desired output

Image classification 8

Playing GO Next move

Next-word-prediction Looking forward to your ? reply

𝑤 is the set of parameters contained by the function

The goal of deep learning

• Given one input sample pair 𝑥#, 𝑦# , the goal of deep learning model training is to find a set of parameters 𝑤, to maximize the probability of outputting 𝑦#given 𝑥#.

Given input: 𝑥# Maximize: 𝑝(5|𝑥#, 𝑤)

Finding the function: model training

• Given a training dataset containing 𝑛 input-output pairs 𝑥,, 𝑦, , 𝑖 ∈ 1, 𝑛 , the goal of deep learning model training is to find a set of parameters 𝑤, such that the average of 𝑝(𝑦,) is maximized given 𝑥,.

Given input:

Output:

• That is,

Which is equivalent to

𝑚𝑎𝑥𝑖𝑚𝑖𝑧𝑒1𝑛4,56

𝑝(𝑦,|𝑥,, 𝑤)

𝑚𝑖𝑛𝑖𝑚𝑖𝑧𝑒1𝑛4,56

−log(𝑝(𝑦,|𝑥,, 𝑤))A basic component for loss function 𝑙(𝑥,, 𝑦,, 𝑤) given sample 𝑥,, 𝑦, :

Let 𝑓, 𝑤 = 𝑙(𝑥,, 𝑦,, 𝑤) denote the loss function.

• Given a training dataset containing 𝑛 input-output pairs 𝑥,, 𝑦, , 𝑖 ∈ 1, 𝑛 , the goal of deep learning model training is to find a set of parameters 𝑤, such that the average of 𝑝(𝑦,) is maximized given 𝑥,.

Deep learning model training

For a training dataset containing 𝑛 samples (𝑥,, 𝑦,), 1 ≤ 𝑖 ≤ 𝑛, the training objective is:

minC∈ℝE

𝑓(𝑤) where 𝑓 𝑤 ≝ 67∑,567 𝑓,(𝑤)

𝑓, 𝑤 = 𝑙(𝑥,, 𝑦,, 𝑤) is the loss of the prediction on example 𝑥,, 𝑦,

No closed-form solution: in a typical deep learning model, 𝑤 may contain millions of parameters.Non-convex: multiple local minima exist.

𝑓(𝑤)

Solution: Gradient Descent

Loss 𝑓(𝑤)

Randomly initialized weight 𝑤

Compute gradient ∇𝑓(𝑤)𝑤IJ6 = 𝑤I − 𝜂∇𝑓(𝑤)

(Gradient Descent)

At the local minimum, ∇𝑓(𝑤) is close to 0.

Learning rate 𝜂 controls the step size

How to stop? – when the update is small enough – converge.

∥ 𝑤IJ6 − 𝑤I ∥≤ 𝜖

or ∥ ∇𝑓(𝑤I) ∥≤ 𝜖

Problem: Usually the number of training samples n is large – slow convergence

Solution: Stochastic Gradient Descent (SGD)

● At each step of gradient descent, instead of compute for all training samples, randomly pick a small subset (mini-batch) of training samples 𝑥N, 𝑦N .

● Compared to gradient descent, SGD takes more steps to converge, but each step is much faster.

𝑤IJ6 ← 𝑤I − 𝜂∇𝑓 𝑤I; 𝑥N, 𝑦N

Outline

q Open problems

The biggest obstacle to using advanced data analysis isn’t skill base or technology; it’s plain old access to the data.

-Edd Wilder-James, Harvard Business Review

“”

The importance of data for ML

“Data is the New Oil”

Google, Apple, ......

ML model

Private data: all the photos a user takes and everything they type on their mobile keyboard, including passwords, URLs, messages, etc.

image classification:e.g. to predict which photos are most likely to be viewed multiple times in the future;

language models:e.g. voice recognition,next-word-prediction, and auto-reply in Gmail

Google, Apple, ......

Instead of uploading the raw data, train a model locally and upload the model.

Addressing privacy:Model parameters will never contain more information than the raw training data

Addressing network overhead:The size of the model is generally smaller than the size of the raw training data

ML model

MODEL AGGREGATIONML model

Federated optimization● Characteristics (Major challenges)

○ Non-IID

■ The data generated by each user are quite different

○ Unbalanced

■ Some users produce significantly more data than others

○ Massively distributed

■ # mobile device owners >> avg # training samples on each device

○ Limited communication

■ Unstable mobile network connections

A new paradigm – Federated Learning

a synchronous update scheme that proceeds in rounds of communication

McMahan, H. Brendan, Eider Moore, Daniel Ramage, and Seth Hampson. "Communication-efficient learning of deep networks from decentralized data." AISTATS, 2017.

Local data

Central Server

Global model M(i)

Model M(i) Model M(i)

Gradient updates for M(i)

In round number i…

Deployed by Google, Apple, etc.

Federated learning – overview

Local data

Central Server

Updates of M(i) Updates of M(i)

Model Aggregation

M(i+1)

Local data

Central Server

Local data

Global model M(i+1)

Model M(i+1)

Round number i+1 and continue…

Federated learning – detail

For efficiency,at the beginning of each round, a random fraction C of clientsis selected, and the server sends the current model parameters to each of these clients.

● Recall in traditional deep learning model training

○ For a training dataset containing 𝑛 samples (𝑥,, 𝑦,), 1 ≤ 𝑖 ≤ 𝑛, the training objective is:

𝑓, 𝑤 = 𝑙(𝑥,, 𝑦,, 𝑤) is the loss of the prediction on example 𝑥,, 𝑦,

○ Deep learning optimization relies on SGD and its variants, through mini-batches

𝑤IJ6 ← 𝑤I − 𝜂∇𝑓 𝑤I; 𝑥N, 𝑦N

minC∈ℝE

𝑓(𝑤) where 𝑓 𝑤 ≝ 67∑,567 𝑓,(𝑤)

● In federated learning

○ Suppose 𝑛 training samples are distributed to 𝐾 clients, where 𝑃N is the set of indices of data points on client 𝑘, and 𝑛N = 𝑃N .

○ For training objective: minC∈ℝE

𝑓(𝑤)

𝑓 𝑤 = ∑N56T 7U7𝐹N(𝑤) where 𝐹N(𝑤) ≝

67U∑,∈WU 𝑓,(𝑤)

A baseline – FederatedSGD (FedSGD)

● A randomly selected client that has 𝑛N training data samples in federated learning ≈ A randomly selected sample in traditional deep learning

● Federated SGD (FedSGD): a single step of gradient descent is done per round

● Recall in federated learning, a C-fraction of clients are selected at each round.

○ C=1: full-batch (non-stochastic) gradient descent

○ C<1: stochastic gradient descent (SGD)

A baseline – FederatedSGD (FedSGD)Learning rate: 𝜂; total #samples: 𝑛; total #clients: 𝐾; #samples on a client k: 𝑛N; clients fraction 𝐶 = 1

● In a round t:

○ The central server broadcasts current model 𝑤I to each client; each client k computes gradient: 𝑔N = ∇𝐹N(𝑤I), on its local data.

■ Approach 1: Each client k submits 𝑔N; the central server aggregates the gradients to generate a new model:

● 𝑤IJ6 ← 𝑤I − 𝜂∇𝑓 𝑤I = 𝑤I − 𝜂 ∑N56T 7U7𝑔N .

■ Approach 2: Each client k computes: 𝑤IJ6N ← 𝑤I − 𝜂𝑔N; the central server performs aggregation:

● 𝑤IJ6 ← ∑N56T 7U7𝑤IJ6N For multiple times ⟹ FederatedAveraging (FedAvg)

Recall f w = ∑^56_ `a`F^(w)

Federated learning – deal with limited communication

● Increase computation

○ Select more clients for training between each communication round

○ Increase computation on each client

Federated learning – FederatedAveraging (FedAvg)Learning rate: 𝜂; total #samples: 𝑛; total #clients: 𝐾; #samples on a client k: 𝑛N; clients fraction 𝐶

● In a round t:

○ The central server broadcasts current model 𝑤I to each client; each client k computes gradient: 𝑔N = ∇𝐹N(𝑤I), on its local data.

■ Approach 2:

● Each client k computes for E epochs : 𝑤IJ6N ← 𝑤I − 𝜂𝑔N

● The central server performs aggregation: 𝑤IJ6 ← ∑N56T 7U7𝑤IJ6N

● Suppose B is the local mini-batch size, #updates on client k in each round: 𝑢N = 𝐸 7Ue.

Federated learning – FederatedAveraging (FedAvg)

Model initialization

● Two choices:

○ On the central server

○ On each client

The loss on the full MNIST training set for models generated by𝜃𝑤 + (1 − 𝜃)𝑤i

Shared initialization works better in practice.

Model averaging

● As shown in the right figure:

The loss on the full MNIST training set for models generated by𝜃𝑤 + (1 − 𝜃)𝑤i

In practice, naïve parameter averaging works surprisingly well.

1. At first, a model is randomly initialized on the central server.

2. For each round t:i. A random set of clients are

chosen;ii. Each client performs local

gradient descent steps;iii. The server aggregates

model parameters submitted by the clients.

Federated learning – Evaluation● #clients: 100● Dataset: MNIST

○ IID: Random partition○ Non-IID: each client only

contains two digits○ Balanced

1 client

FedSGD FedSGDFedAvg FedAvg

#rounds required to achieve a target accuracy on test dataset.

Image classification

Impact of varying C

In general, the higher C, the smaller #rounds to reach target accuracy.

Federated learning – Evaluation● Dataset from: The Complete Works of Shakespeare

○ #clients: 1146, each corresponding to a speaking role○ Unbalanced: different #lines for each role○ Train-test split ratio: 80% - 20%○ A balanced and IID dataset with 1146 clients is also constructed

● Task: next character prediction

● Model: character-level LSTM language model

Language modeling

Federated learning – Evaluation

● The effect of increasing computation in each round (decrease B / increase E)

● Fix C=0.1

In general, the more computation in each round, the faster the model trains.FedAvg also converges to a higher test accuracy (B=10, E=20).

Federated learning – Evaluation Language modeling

● The effect of increasing computation in each round (decrease B / increase E)

● Fix C=0.1

In general, the more computation in each round, the faster the model trains.FedAvg also converges to a higher test accuracy (B=10, E=5).

● What if we maximize the computation on each client? 𝐸 → ∞

Best performance may achieve at earlier rounds; increasing #rounds do not improve.

Best practice: decay the amount of local computation when the model is close to converge.

Federated learning – Evaluation● #clients: 100● Dataset: CIFAR-10

○ IID: Random partition○ Non-IID: each client only

contains two digits○ Balanced

● Dataset from: 10 million public posts from a large social network○ #clients: 500,000, each

corresponding to an author

● Task: next word prediction

● Model: word-level LSTM language model

Language modelingA real-world problem

200 clients per round; B=8, E=1

Outline

q Open problems

Federated learning – related research

Data poisoning attacks. How to backdoor federated learning, arXiv:1807.00459.

Secure aggregation. Practical Secure Aggregation for Privacy-Preserving Machine Learning, CCS’17

Client-level differential privacy. Differentially Private Federated Learning: A Client-level Perspective, ICLR’19

Decentralize the central server via blockchain.

Google FL Workshop: https://sites.google.com/view/federated-learning-2019/home

HiveMind: Decentralized Federated Learning

Local data

Local dataOasis Blockchain Platform

HiveMindSmart Contract

aggregated noise

DifferentiallyPrivate Global Model

Differential privacy Secure aggregation Model encryptionDP noise

Local data

aggregated noise

Model M(i)

Global model M(i)

Local data

DP noise+

Local data

Differential privacy Secure aggregation Model encryption

DP noise+

DP noise

Local data

DP noise+

DP noise

Secure Aggregation

aggregated noise

Secure Aggregation

Local data

DP noise+

DP noise+ Gradient updates

for M(i)DP

noise+

DP noise+

DP noise

Global model M(i+1)

Local data

aggregated noise

Global model M(i+1)

Round number i+1 and continue…

DP noise

Local data

aggregated noise

Outline

q Open problems

Federated learning – open problems

● Detect data poisoning attacks, while secure aggregation is being used.

● Asynchronous model update in federated learning and its co-existence with secure aggregation.

● Further reduce communication overhead through quantization etc.

● The usage of differential privacy in each of the above settings.

● ……

Min Dumin.du@berkeley.edu

Thank you!

Federated Learningcs294-163/fa19/slides/... · 2019-10-02 · Updates of M(i) Updates of M(i)...

Documents

Transcript of Federated Learningcs294-163/fa19/slides/... · 2019-10-02 · Updates of M(i) Updates of M(i)...

UCF REU: Weeks 1 & 2. Gradient Code Gradient Direction of the Gradient: Calculating theta.

Matrix Exponentiated Gradient Updates for On …jmlr.csail.mit.edu/papers/volume6/tsuda05a/tsuda05a.pdfJournal of Machine Learning Research 6 (2005) 995–1018 Submitted 11/04; Revised

Steepest Gradient Method Conjugate Gradient Method

GRAVITY GRADIENT STABILIZATIONGRAVITY GRADIENT STABILIZATION Vertical stabilization of a satellite is achievable by means of the earth's gravity gradient. Gravity gradient attitude

Exponentiated Gradient versus Gradient Descent for Linear ...manfred/pubs/J36.pdf · Exponentiated Gradient versus Gradient Descent for Linear Predictors* Jyrki Kivinen-Department

Lecture 8: Policy Gradient I 2 - Stanford Universityweb.stanford.edu/class/cs234/slides/cs234_2018_l8.pdf · Lecture 8: Policy Gradient I 2 Emma Brunskill CS234 Reinforcement Learning.

Accelerating Gradient Boosting Machineweb.mit.edu/haihao/www/papers/AGBM.pdf · Accelerating Gradient Boosting Machine where ‘(y i;f(x i)) is a measure of the data-ﬁdelity for

Fracture Gradient - Part I

Learning to learn by gradient descent by gradient descentpapers.nips.cc/...to...descent-by-gradient-descent.pdf · Learning to learn by gradient descent by gradient descent Marcin

On Gradient-Based Optimization: Accelerated, Asynchronous ... · On Gradient-Based Optimization: Accelerated, Asynchronous, Distributed and Stochastic Michael I. Jordan University

Health Updates- I – News/New Researches Compiled

gradient 1 gradient 2 gradient 3 gradient 4 ECDIS buyers · PDF file11 Type-specific ECDIS training 11 Technical training 11 Where should training take place? ... gradient 1 gradient

Long wavelength gradient drift instability in Hall plasma ... instability.pdf · Long wavelength gradient drift instability in Hall plasma devices. I. Fluid theory ... gradients on

Gradient-based Methods for Optimization. Part I.math.oregonstate.edu/~gibsonn/optpart1.pdf · Gradient-based Methods for Optimization. Part I. ... Applied Math and Computation Seminar

Mode I crack tip elds: strain gradient plasticity theory versus J2 ow theory · 2019-02-21 · Mode I crack tip elds: strain gradient plasticity theory versus J2 ow theory Emilio

Wisconsin Phar 7 Updates - PSW i

Exponentiated Gradient versus Gradient Descent for Linear Predictors

GRADIENT CHANGE i/i ARMY ARMAMENT RESEARCH … · 2014. 9. 27. · ad-a166 416 _ _ _ tecmn#cal reort arcctr-66007 analysis of gradient change thresholds in the detection of edges

Updates and Reminders. Topics Policy Reminders Policy Reminders Policy Updates Policy Updates DAIS-I Specs DAIS-I Specs FY08 Policy Updates FY08 Policy.

I gip palladium updates (1)