IEEE ICASSP 2020 Tutorial on ... - federated-ml.org · Traditional Centralized (“Cloud”) ML...
Transcript of IEEE ICASSP 2020 Tutorial on ... - federated-ml.org · Traditional Centralized (“Cloud”) ML...
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Wojciech Samek
(Fraunhofer HHI)
1. Introduction (WS)
2. Model Compression & Efficient Deep Learning (WS)
3. (Virtual) Coffee Break
4. Distributed & Federated Learning (FS)
Felix Sattler(Fraunhofer HHI)
Outline of tutorial
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Part III: Distributed & Federated LearningWojciech Samek & Felix Sattler
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Distributed Learning
Data Data Data
Server
Data
Data
Data
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Traditional Centralized (“Cloud”) ML
Data
Data
Data
Server
→ Data is gathered Centrally Problems
– Privacy
– Ownership (→ who owns the data?)
– Security (→ single point of failure)
– Efficiency (→ need to move data around)
train model
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Distributed (/ “Embedded”) ML
Data
Data
Data
Server
train model
Problems
– Privacy
– Ownership (→ who owns the data?)
– Security (→ single point of failure)
– Efficiency (→ need to move data around)
train model
train model
→ Data never leaves the local Devices
→ Instead model Updates are communicated
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Distributed (/ “Embedded”) ML: Settings
Detailed Comparison: Sattler, Wiegand, Samek. "Trends and Advancements in Deep Neural Network Communication."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning
“Federated Learning is a machine learning setting where multiple entities collaborate in solving a learning problem, without directly exchanging data. The
Federated training process is coordinated by a central server.”
Kairouz, Peter, et al. "Advances and open problems in federated learning."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning
Data Data Data
Server
Data
Data
Data
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning
Data Data Data
Server
SGDSGD SGD
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning
Data Data Data
Serveraveraging
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning
Data Data Data
Server
Data
Data
Data
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning - Settings
Cross Device
– Large Number of Clients
– Only fraction of Clients available at any given time
– Few data points per Client
– Limited computational resources
Cross Silo
– Small number of Clients
– Clients are always available
– Large local data sets
– Strong computational resources
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning - Challenges
Challenges in Federated Learning
Convergence
Heterogeneity
Privacy
Robustness Personalization
Communication
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning - Communication
Data Data Data
Server Server
Data Data Data
Download Upload
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning - Communication
Total Communication = [#Communication Rounds] x [#Parameters] x [Avg. Codeword length]
Case Study: VGG16 on ImageNet
– Number of Iterations until Convergence: 900.000
– Number of Parameters: 138.000.000
– Bits per Parameter: 32
→ Total Communication = 496.8 Terabyte (Upload+Download)
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Compression Methods
Total Communication = [#Communication Rounds] x [#Parameters] x [Avg. Codeword length]
Compression Methods
– Communication Delay
– Lossy Compression: Unbiased
– Lossy Compression: Biased
– Efficient Encoding
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Communication Delay
Distributed SGD:
For t=1,..,[Communication Rounds]:
For i=1,..,[Participating Clients]:
Client does:
Server does:
Federated Averaging:
For t=1,..,[Communication Rounds]:
For i=1,..,[Participating Clients]:
Client does:
Server does:
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Communication Delay
Advantages:
● Simple
● Reduces Communication Frequency (advantageous in on-device FL)
● Reduces both Upstream and Downstream communication
● Easy to integrate with Privacy mechanisms
Disadvantages:
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Communication Delay
Convergence Analysis for Convex Objectives:
IID-Assumption:
– Clients Participating per Round
– Total communication Rounds
– Local Iterations per Round
– Lipschitz Parameter of the Loss Function
– Bound on the Variance of the Stochastic Gradients
Lan, Guanghui. "An optimal method for stochastic composite optimization."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Statistical Heterogeneity
Convergence speed drastically decreases with increasing heterogeneity in the data
→ This effect aggravates if the number of participating clients (“reporting fraction”) is low
Hsu, Tzu-Ming Harry, Hang Qi, and Matthew Brown. "Measuring the effects of non-identical data distribution for federated visual classification."
Dirichlet Parameter
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Communication Delay
Kairouz, Peter, et al. "Advances and open problems in federated learning."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Communication Delay
Advantages:
● Simple
● Reduces Communication Frequency (more practical in on-device FL)
● Reduces Upstream + Downstream communication
● Easy to integrate with Privacy mechanisms
Disadvantages:
● Bad performance on non-iid data
● Low sample efficiency
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Compression Methods
Total Communication = [#Communication Rounds] x [#Parameters] x [Avg. Codeword length]
Compression Methods
– Communication Delay
– Lossy Compression: Unbiased
– Lossy Compression: Biased
– Efficient Encoding
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Update Compression
Distributed SGD:
For t=1,..,[Communication Rounds]:
For i=1,..,[Participating Clients]:
Client does:
Server does:
Distributed SGD with Compression:
For t=1,..,[Communication Rounds]:
For i=1,..,[Participating Clients]:
Client does:
Server does:
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Update Compression
------ Unbiased ------ ----------------------- Biased -----------------------
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Unbiased Compression
Definition: A compression operator is called unbiased iff,
Pros:
➔ “Straight forward” Convergence Analysis (Stochastic Gradients with increased variance)
➔ Variance reduction (uncorrelated noise)
variance of the gradient estimator
Strongly convex bound:
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Unbiased Compression
● Pros: “Straight forward” Convergence Analysis (Stochastic Gradients, increased variance)
● Cons: Variance blow-up leads to poor empirical performance
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Biased Compression
Definition: A compression operator is called biased iff,
→ Can be turned into convergent methods via error accumulation.
Karimireddy, et al. "Error feedback fixes signsgd and other gradient compression schemes."
Stich, Cordonnier, Jaggi. "Sparsified SGD with memory."
Biased compression methods do not necessarily converge!
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Error AccumulationDistributed SGD with
Error Accumulation:
For t=1,..,[Communication Rounds]:
For i=1,..,[Participating Clients]:
Client does:
Server does:
Distributed SGD:
For t=1,..,[Communication Rounds]:
For i=1,..,[Participating Clients]:
Client does:
Server does:
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Error Accumulation
For a parameter , a - contraction operator is a (possibly randomized) operator that satisfies the contraction property
Theorem (Stich et al.): For any contraction operator, compressed SGD with Error Accumulation for large T achieves convergence rate on mu-strongly convex objective functions with asymptotic rate:
Independent of alpha!
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Recap Compression
Unbiased Biased
MethodsTernGrad, QSGD, Atomo Gradient Dropping, Deep
Gradient Compression, signSGD, PowerSGD,
Convergence Proofs
Bounded Variance Assumption
k-contraction Framework (Stich et al. 2018)
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Combining Methods: Sparse Binary Compression
Sattler, et al. "Sparse binary compression: Towards distributed deep learning with minimal communication." 2019 International Joint Conference on Neural Networks (IJCNN).
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Sparse Binary Compression
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Sparse Binary Compression for non-iid Data
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Combining Methods
Sattler, Wiedemann, Müller, Samek. "Robust and communication-efficient federated learning from non-iid data." IEEE TNNLS (2019).
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Efficient Encoding – DeepCABAC
● CABAC best encoder for quantized parameter tensors
● Plug & Play
● Can be used as a final lossless compression stage for all compression methods that we have presented
Wiedemann et al. "DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning - Challenges
Challenges in Federated Learning
Convergence
Heterogeneity
Privacy
Robustness Personalization
Communication
✔
( )✔
✔
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Privacy
Hitaj, Briland, Giuseppe Ateniese, and Fernando Perez-Cruz. "Deep models under the GAN: information leakage from collaborative deep learning."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Privacy
Data Data Data
Server
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Privacy
Privacy Protection Mechanisms:
- Secure Multi-Party Computation
- Homomorphic Encryption
- Trusted Execution Environments
- Differential Privacy
crypto
Dwork, Cynthia, and Aaron Roth. "The algorithmic foundations of differential privacy."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Differential Privacy
A mechanism is called
differentially private with parameter iff,
t
P[A(D)=t]P[A(D’)=t]
for any two data sets and which differ in only one element.
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Differential Privacy – Mechanisms
Global Sensitivity:
The Laplace mechanism with is - differentially private.
Data
Global Sensitivity:
+ NOISE
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Differential Privacy – Post Processing
Data released via a - private mechanism is - private under arbitrary post processing!
Datadifferentially
privatealgorithm
non-privatepost-
processing
privacy barrier
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Differential Privacy – Basic Composition
→ privacy loss is additive!
After applying R algorithms with
the total privacy loss is:
This is a worst-case analysis – better bounds can be found using more elaborated accounting mechanisms (e.g. moments accountant)
to the data
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Privacy and Communication
We need methods which are both communication-efficient and pivacy-preserving!
Differential Privacy:
➔ adds artificial noise to the parameter updates to obfuscate them
Compression Methods:
➔ add quantization noise to the parameter updates to reduce the bitwidth
→ combine the two approaches!
Li, Tian, et al. "Privacy for Free: Communication-Efficient Learning with Differential Privacy Using Sketches."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning - Challenges
Challenges in Federated Learning
Convergence
Heterogeneity
Privacy
Robustness Personalization
Communication
✔
( )✔
✔✔
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Meta- and Multi Task- Learning
Federated Learning Environments are characterized by a high degree of statistical heterogeneity of the client data
→ In many situations, learning one single central model is suboptimal or even undesirable
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Meta- and Multi Task- Learning
→ IID data → non-IID data
Client data:
→ one model can be learned
→ no single model can fit the data of all clients
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Meta- and Multi Task- Learning
I like ...
I like ice cream.
I like cats.
I like Beyoncé.
A linear classifier can correctly separate the data of every single client, but not simultaneously for all clients
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning
Data
Data
Data
Data
Data
Data
Data
Data
Data
Server
How to identify the clusters?
→ Clustered Federated Learning groups the client population into clusters with jointly trainable data distributions and trains a separate model for every cluster
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning
Data
Data
Data
Data
Data
Data
Data
Data
Data
Server
How to identify the clusters?
→ via the model updates!
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning
-1
1
Cosine-similarity
Communication rounds
At every stationary solution of theFederated Learning objective, theangle between the parameter updates of the different clients is highly indicative of their distribution similarity!
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning - Algorithm
1.) Run Federated Learning until convergence to a stationary solution
2.) Compute the pairwise cosine similarity between the latest parameter updates from all clients
3.) If there exists a client whose local empirical risk is not sufficiently minimized by the federated learning solution ...
4.) … then bi-partition the client population into two groups of minimal pairwise similarity
5.) Repeat everything for the two groups, starting from 1.)
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning – Clustering Guarantees
Let
and
Then the proposed mechanism will correctly separate the clients if
with
and being the number of clusters.
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Clustered Federated Learning
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning
Measure cluster quality via:
Then:
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning Few data points are sufficient to obtain correct clustering
Few communication rounds are sufficient in order to obtain a correct clustering
Work also with weight updates
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data
Data Data
Data
Data
Data
Data
Data
Federated Learning 1st Split 2nd Split 3rd Split
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Clustered Federated Learning 1) FL has converged to a stationary
solution
2) After the 1st split Accuracy drastically increases for the group of clients that was separated out
3) After the third round of splitting g(a) has reduced to below zero for all remaining clusters
Sattler, Müller, Samek. "Clustered Federated Learning: Model-Agnostic Distributed Multi-Task Optimization under Privacy Constraints."
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning – Clustered Federated Learning
Sattler, Müller, Samek. “On the Byzantine Robustness of Clustered Federated Learning” (ICASSP 2020)
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning - Challenges
Challenges in Federated Learning
Convergence
Heterogeneity
Privacy
Robustness Personalization
Communication
✔
✔
✔✔
✔✔
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Neural Network Compression
References
S Wiedemann, H Kirchhoffer, S Matlage, P Haase, A Marban, T Marinc, D Neumann, T Nguyen, A Osman, H Schwarz, D Marpe, T Wiegand, W Samek. DeepCABAC: A Universal Compression Algorithm for Deep Neural Networks. IEEE Journal of Selected Topics in Signal Processing, 14(3):1-15, 2020.http://dx.doi.org/10.1109/JSTSP.2020.2969554
S Yeom, P Seegerer, S Lapuschkin, S Wiedemann, KR Müller, W Samek. Pruning by Explaining: A Novel Criterion for Deep Neural Network Pruning. arXiv:1912.08881, 2019.https://arxiv.org/abs/1912.08881
S Wiedemann, H Kirchhoffer, S Matlage, P Haase, A Marban, T Marinc, D Neumann, A Osman, D Marpe, H Schwarz, T Wiegand, W Samek. DeepCABAC: Context-adaptive binary arithmetic coding for deep neural network compression. Joint ICML'19 Workshop on On-Device Machine Learning & Compact Deep Neural Network Representations (ODML-CDNNR), 1-4, 2019. *** Best paper award ***https://arxiv.org/abs/1905.08318
S Wiedemann, A Marban, KR Müller, W Samek. Entropy-Constrained Training of Deep Neural Networks.Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), 1-8, 2019.http://dx.doi.org/10.1109/IJCNN.2019.8852119
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Efficient Deep Learning
References
S Wiedemann, KR Müller, W Samek. Compact and Computationally Efficient Representation of Deep Neural Networks. IEEE Transactions on Neural Networks and Learning Systems, 31(3):772-785, 2020.http://dx.doi.org/10.1109/TNNLS.2019.2910073
S Wiedemann, T Mehari, K Kepp, W Samek. Dithered backprop: A sparse and quantized backpropagation algorithm for more efficient deep neural network training. Proceedings of the CVPR'20 Joint Workshop on Efficient Deep Learning in Computer Vision, 2020.http://arxiv.org/abs/2004.04729
A Marban, D Becking, S Wiedemann, W Samek. Learning Sparse & Ternary Neural Networks with Entropy-Constrained Trained Ternarization (EC2T). Proceedings of the CVPR'20 Joint Workshop on Efficient Deep Learning in Computer Vision, 2020.http://arxiv.org/abs/2004.01077
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning
References
F Sattler, T Wiegand, W Samek. Trends and Advancements in Deep Neural Network Communication. ITU Journal: ICT Discoveries, 2020.https://arxiv.org/abs/2003.03320
F Sattler, KR Müller, W Samek. Clustered Federated Learning: Model-Agnostic Distributed Multi-Task Optimization under Privacy Constraints. arXiv:1910.01991, 2019.https://arxiv.org/abs/1910.01991
F Sattler, S Wiedemann, KR Müller, W Samek. Robust and Communication-Efficient Federated Learning from Non-IID Data. IEEE Transactions on Neural Networks and Learning Systems, 2019.http://dx.doi.org/10.1109/TNNLS.2019.2944481
F Sattler, KR Müller, W Samek. Clustered Federated Learning. Proceedings of the NeurIPS'19 Workshop on Federated Learning for Data Privacy and Confidentiality, 1-5, 2019.
IEEE ICASSP 2020 Tutorial on Distributed and Efficient Deep Learning
Federated Learning
References
F Sattler, KR Müller, T Wiegand, W Samek. On the Byzantine Robustness of Clustered Federated Learning. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 8861-8865, 2020.http://dx.doi.org/10.1109/ICASSP40776.2020.9054676
F Sattler, S Wiedemann, KR Müller, W Samek. Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication. Proceedings of the IEEE International Joint Conference on Neural Networks (IJCNN), 1-8, 2019.http://dx.doi.org/10.1109/IJCNN.2019.8852172
Slides and Papers available at