Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper...
-
Upload
duongduong -
Category
Documents
-
view
214 -
download
0
Transcript of Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper...
![Page 1: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/1.jpg)
Towards Scalable Machine Learning
Janis Keuper
itwm.fraunhofer.de/ml
Competence Center High Performance ComputingFraunhofer ITWM, Kaiserslautern, Germany
Fraunhofer Center Machnine Larning
![Page 2: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/2.jpg)
Outline
I Introduction / Definitions
II Is Machine Learning a HPC Problem?
III Case Study: Scaling the Training of Deep Neural Networks
IV Towards Scalable ML Solutions [current Projects]
V A look at the (near) future of ML Problems and HPC
![Page 3: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/3.jpg)
Machine Learning @CC-HPC
Scalable distributed ML AlgorithmsDistributed Optimization MethodsCommunication Protocols
Distributed DL Frameworks
“Automatic” MLDL Meta-Parameter LearningDL Topology Learning
HPC-Systems for Scalable MLDistributed I/ONovel ML Hardware Low Cost ML Systems
DL Methods:Semi- and Unsupervised DLGenerative Models ND CNNs
ML HPC
Industry ApplicationsDL Software optimization for Hardware / Clusters DL for Seismic AnalysisDL Chemical Reaction Prediction DL for autonomous driving
I Introduction
![Page 4: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/4.jpg)
I Setting the Stage | Defnitions
Scalable ML vs Large Scale ML
● Large model size (implies large data as well)
● Extreme compute effort
● Goals:
● Larger models● (linear) strong an weak
scaling through (distributed) parallelization
→ HPC
● Very large data sets (online stream)
● “normal” model size and compute effort(traditional ML methods)
● Goals:
● Make training feasible● Often online training
→ Big Data
![Page 5: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/5.jpg)
Scaling DNNs
Layer 1
Layer 2
Layer n-1
Layer n
...
Simple strategy in DL if it does not work: scale it!
Scaling in two dimensions:
1. Add more layers = more matrix mult more convolutions
2. Add more units = larger matrix mult more convolutions
Don't forget: in both cases
MORE DATA! → more iterations
![Page 6: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/6.jpg)
Scaling DNNs
Network of Networks
137 billion free parameters !
![Page 7: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/7.jpg)
II Is Scalable ML a HPC Problem?
● In terms of compute needed (YES)
● Typical HPC Problem setting: is communication bound = non trivial parallelization (YES)
● I/O bound (New to HPC)
![Page 8: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/8.jpg)
Success in Deep Learning is driven by compute power:
https://blog.openai.com/ai-and-compute/
→ # FLOP needed to compute leading model is ~ doubling every 3.5 months !
→ increase since 2012: factor ~300.000 !
![Page 9: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/9.jpg)
Impact on HPC (Systems)
● New HPC Systems
● Like ONCL “Summit”● Power 9● ~30k NVIDIA Volta GPUs● New storage hierarchies
● New Users (=new demands)
● Still limited resources
https://www.nextplatform.com/2018/03/28/a-first-look-at-summit-supercomputer-application-performance/
![Page 10: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/10.jpg)
III Case Study: Training DNNs
I Overview: distributed parallel training of DNNs
II Limits of Scalability
Limitation I: Communication Bounds
Limitation II: Skinny Matrix Multiplication
Limitation III: Data I/O
Based on our paper from SC 16
![Page 11: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/11.jpg)
Deep Neural NetworksIn a Nutshell
Layer 1
Layer 2
Layer 3 Layer 4
Layer 5
Layer 6
At an abstract level, DNNs are:
● directed (acyclic) graphs● Nodes are compute entities (=Layers)● Edges define the data flow through the graph
Inference / Training
Forward feed of data through the network
![Page 12: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/12.jpg)
Deep Neural NetworksIn a Nutshell
At an abstract level, DNNs are:
● directed (acyclic) graphs● Nodes are compute entities (=Layers)● Edges define the data flow through the graph
Inference / Training
Forward feed of data through the network
Layer 1
Layer 2
Layer n-1
Layer n
...
Common intuition
![Page 13: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/13.jpg)
Minimize Loss-Function via gradient descent (high dimensional and NON CONVEX!)
Computed via Back Propagation Algorithm:
1. feed forward and compute activation2. error by layer3. compute derivative by layer
Training Deep Neural NetworksThe Underlying Optimization Problem
![Page 14: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/14.jpg)
Optimization Problem
Layer 1
Layer 2
Layer n-1
Layer n
...
forward backward
By Stochastic Gradient Descent (SGD)
1. Initialize weights W at random 2. Take small random subset X (=batch) of the train data3. Run X through network (forward feed)4. Compute Loss5. Compute Gradient6. Propagate backwards through the network7. Update W
Repeat 2-8 until convergence
![Page 15: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/15.jpg)
Parallelization Common approaches to parallelize SGD for DL
Parallelization of SGD is very hard: it is an inherently sequential algorithm
1. Start at some state t (point in a billion dimensional space)2. Introduce t to data batch d13. Compute an update (based on the objective function)4. Apply the update →t+1
How to gain Speedup ?
Make faster updatesMake larger updates
State t State t+1State t+2
d1 d2d3
![Page 16: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/16.jpg)
Parallelization Common approaches to parallelize SGD for DL
Layer 1
Layer 2
Layer n-1
Layer n
...
forward backwardInternal parallelization = parallel execution of the layer operation
● Mostly dense matrix multiplication: ● standard blas sgemm● MKL, Open-Blas● CuBlas on GPUs
● Task parallelization for special Layers● Cuda-CNN for fast convolutions
![Page 17: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/17.jpg)
Parallelization Common approaches to parallelize SGD for DL
Layer 1
Layer 2
Layer n-1
Layer n
...
forward
External: data parallelization over the data batch
Layer 1
Layer 2
Layer n-1
Layer n
...
forward
Layer 1
Layer 2
Layer n-1
Layer n
...
forward
Master
1. Split batchand send to workers
![Page 18: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/18.jpg)
Parallelization Common approaches to parallelize SGD for DL
Layer 1
Layer 2
Layer n-1
Layer n
...
forward backward
External: data parallelization over the data batch
Layer 1
Layer 2
Layer n-1
Layer n
...
forward backward
Layer 1
Layer 2
Layer n-1
Layer n
...
forward backward
Master
1. Split batchand send to workers
2. Workers compute forward+backward and send gradients to master
![Page 19: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/19.jpg)
Parallelization Common approaches to parallelize SGD for DL
Layer 1
Layer 2
Layer n-1
Layer n
...
forward backward
External: data parallelization over the data batch
Layer 1
Layer 2
Layer n-1
Layer n
...
forward backward
Layer 1
Layer 2
Layer n-1
Layer n
...
forward backward
Master
1. Split batchand send to workers
2. Workers compute forward+backward and send gradients to master
3. Master combines gradients and computes updates of W. Sends new W' to workers
![Page 20: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/20.jpg)
Parallelization Common approaches to parallelize SGD for DL
External: data parallelization over the data batch
Master
1. Split batchand send to workers
2. Workers compute forward+backward and send gradients to master
3. Master combines gradients and computes updates of W. Sends new W' to workers
Speedup
![Page 21: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/21.jpg)
Limitation IDistributed SGD is heavily Communication Bound
Gradients have the same size as the model
● Model size can be hundreds of MB● Iteration time (GPU) <1s
![Page 22: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/22.jpg)
Communication BoundExperimental Evaluation
![Page 23: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/23.jpg)
Solving the Communication Bottleneck Solutions in Hardware: example NVLink
NVIDIA DGX-1 / HGX-1
● 8x P100● DGX-1: NVLink between all GPUs
NVLink spec: ~40GB/s (single direction)NVLink II (Volta): ~75GB/s (single direction)
PCIe v3 16x: ~15GB/s
Diagram DGX-1 by NVIDIA
![Page 24: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/24.jpg)
Solving the Communication Bottleneck Solutions in Hardware: NVLink Benchmark
IBM Minsky
● 4x P100● 2x10 core Power8 (160 hw threads)● NVLink between all components
NVLink spec: ~40GB/s (single direction)
0.5 1 1.5 2 2.5 3 3.5 4 4.50
0.5
1
1.5
2
2.5
3
3.5
4
4.5
Minsky Benchmark Deep Learning
AlexNet Topology with NVIDIA-Caffe on ImageNet Data (batch_size 1024)
Linear Scaling Theoretical Speedup Limit Actual Speedup PCIe 4x K80
# GPUs used
Sp
ee
du
p
![Page 25: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/25.jpg)
Limitation IDistributed SGD is heavily Communication Bound
How to solve this in distributed environments?
![Page 26: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/26.jpg)
Limitation IIMathematical Problems – aka the “Large Batch Size Problem”
Recall: speedup comes from pure data-parallelism
→ splitting the batch over the workers
![Page 27: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/27.jpg)
Problem: Batch size decreasing with distributed scaling
Hard Theoretic Limit: b > 0
→ GoogLeNet: No Scaling beyond 32 Nodes→ AlexNet: Limit at 256 Nodes
External Parallelization hurts the internal (BLAS / cuBlas) parallelizationeven earlier.
In a nutshell: for skinny matrices there is simply not enough work for efficient internal parallelization over many threads.
“Small Batch Size Problem”Data Parallelization over the Batch Size
![Page 28: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/28.jpg)
“Small Batch Size Problem”Data Parallelization over the Batch Size
256 128 64 32 16 8 4 2 11
2
4
8
16
32
64
128
256
MKL SGEMM
linear scaling
batch sizesp
eedu
p
Computing Fully Connected Layers:
Single dense Matrix Multiplication
![Page 29: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/29.jpg)
Experimental EvaluationIncreasing the Batch Size
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
256
512
1024
2048
Iteration
Acc
ura
cy
Solution proposed in literature:
Increase Batch size
But:
Linear speedup against original Problem only if we can reduce the number iterations accordingly
This leads to loss of accuracy
AlexNet on ImageNet
![Page 30: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/30.jpg)
The “large batch” ProblemCentral Questions
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
256
512
1024
2048
Iteration
Acc
urac
y
Why is this happening?
Is it dependent on the Topologie / other parameters?
How can it be solved?
→ large batch size SGD would solve most scalability problems!
![Page 31: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/31.jpg)
The “large batch” ProblemWhat is causing this effect? [theoretical and not so theoretical explanations]
→ The “bad minimum”
→ gradient variance / coupling of learning rate and batch size
![Page 32: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/32.jpg)
The “bad minimum” Theory: larger batch causes degrease in gradient variance, causing convergence to local minima...
→ empirical evaluation shows high correlation of sharp minima and weak generalization
![Page 33: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/33.jpg)
Limitation IIMathematical Problems – aka the “Large Batch Size Problem”
Problem not fully understood
No general solutions (yet) !
Do we need novel optimization methods ?!
![Page 34: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/34.jpg)
Limitation IIIData I/O
Hugh amounts of training data need toBe streamed to the GPUs
● Usually 50x – 100x the training data set● Random sampling (!)● Latency + Bandwidth competing with
optimization communication
![Page 35: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/35.jpg)
Distributed I/ODistributed File Systems are another Bottleneck !
● Network bandwidth is already exceeded by the SGD communication ● Worst possible file access pattern:
● Access many small files at random
This problem already has effects on local multi-GPU computations
E.g. on DG-X1 or Minsky, single SSD (~0.5 GB/s) to slow to feed >= 4 GPUs
-> solution: Raid 0 with 4 SSDs
![Page 36: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/36.jpg)
Distributed I/O
256 128 64 32 16 8 4 2 10%
10%20%30%40%50%60%70%80%90%
100%
Compute time by Layer
AlexNet (GPU + cuDNN)
Split
SoftmaxWithLoss
ReLU
Pooling
LRN
InnerProduct
Dropout
Data
Convolution
Concat
batch size
Results shown for SINGLE node access to a Lustre working directory (HPC Cluster, FDR-Infiniband)
Distributed File Systems are another Bottleneck !
256 128 64 32 16 8 4 2 10%
10%20%30%40%50%60%70%80%90%
100%
Compute time by Layer
AlexNet (GPU + cuDNN)
Split
SoftmaxWithLoss
ReLU
Pooling
LRN
InnerProduct
Dropout
Convolution
Concat
batch size
Results shown for SINGLE node Data on local SSD.
![Page 37: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/37.jpg)
IV Towards Scalable Solutions
Distributed Parallel Deep Leaning with HPC tools + Mathematics
![Page 38: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/38.jpg)
CaffeGPI: Distributed Synchronous SGD ParallelizationWith asynchronous communication overlay
Better scalability using asynchronous PGAS programming of optimizationalgorithms with GPI-2.
Direct RDMA access to main and GPUmemory instead of message passing
Optimized data-layers for distributedFile systems
2
![Page 39: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/39.jpg)
CaffeGPI: Approaching DNN Communication
● Communication Reduction Tree● Communication Quantization● Communication Overlay● Direct RDMA GPU→GPU
● Based on GPI
● Optimized distributed data-layer
CC-HPC Current Projects
![Page 40: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/40.jpg)
CaffeGPI: Open Source
https://arxiv.org/abs/1706.00095
https://github.com/cc-hpc-itwm/CaffeGPI
![Page 41: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/41.jpg)
Projects: Low Cost Deep Learning Setup: Build on CaffeGPI
GTX1080:0
GTX1080:1
SSD:0
IB:1 FDR
IB:0 FDR
GTX1080:2
GTX1080:3
SSD:1
IB:1 FDR
IB:0 FDR CaffeGPI
Workstation 1 Workstation 2
Price: <8000 EUR standard components
Specs:
● 32 GB GPU Mem● 64 GB PGAS Mem● 2TB BeeGFS for Train Data● GPU interconnect: PCIe
![Page 42: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/42.jpg)
CaffeGPI: Benchmarks: Single Node
Specs:
4x K80 GPU (PCIe Interconnect)Cuda 8CuDNN 5.1
Topology: AlexNetBatch Size (global) (1024)
![Page 43: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/43.jpg)
Low Cost Deep Learning Setup: Build on CaffeGPI
Specs: Topology: GoogLeNet, Cuda 8, cuDNN 5.1, CaffeNV 16.4, Batch Size/Node: 64
1 2 4 80
50
100
150
200
250
300
350
Time till Convergence
DGX-1 CaffeNVITWM CaffeGPI
#GPUs
Tra
inin
g ti
me
in h
1 2 4 81.00
10.00
1.00
1.96
3.80
6.84
1.00
1.73
3.21
Scaling
DGX-1 CaffeNVITWM CaffeGPI
#GPUs
Sp
ee
du
p
![Page 44: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/44.jpg)
CaffeGPI: Benchmarks
Specs:
1x K80 GPU (per node)Cuda 8CuDNN 5.1
Topology: GloogleNetBatch Size: 64 per node
1 2 4 8 161.00
10.00
100.00
1.00
1.97
3.97
7.97
15.93
Distrubuted Scaling of CaffeGPI
#Nodes
Sp
ee
du
p
![Page 45: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/45.jpg)
New Project: Low Cost Deep Learning Cluster
GPU Cluster for Fraunhofer Consortium @ITWM
● Low cost Hardware● Consumer GPUs● Novel AMD architecture● Hosting Cost per GPU ~ 1.25 k EUR. Compared to DGX-1 ~ 10k
● Fast Interconnect and Data I/O● Parallel FS with local NVMe
● Open Source Multi-User Management● Reservation system● Scheduling● Custom Containers● Web Interface
GTX1080ti:0
GTX1080ti:1
SSD:0
IB:1 FDR
IB:0 FDR
![Page 46: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/46.jpg)
Low Cost Deep Learning SetupCurrently building: Prototype 16-Node System
CC-HPC - Fraunhofer ITWM, Kaiserslautern
N0 N1 N2 N3 N4 N5 N6 N7
N8 N9 N10 N11 N12 N13 N14 N15
HPC IB Switch 32 Ports FDE
BeeGFS Storage Server
BeeGFS Storage Server
BeeGFS Parameter Server
Ma
na
ge
me
nt
No
de
Test + BackupNode
Be
eG
FS
Ma
na
ge
me
nt
Se
rve
r
LD
AP
Se
rve
r
Pro
xy
No
de
Acc
ess
Co
ntr
ol
SL
UR
M S
erv
er
Pro
x y
Mo
nito
ring
Se
rve
r
11.11.11.253
Ext
ern a
l IP
USV
USV
NE
AN
EA
Ext
ern
al I
P
User Access
![Page 47: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/47.jpg)
Project Carme An open source software stack for multi-user GPU clusters
Carme (/ˈkɑːrmiː/ KAR-mee; Greek: Κάρμη) is a Jupiter moon, also giving the name for a Cluster of Jupiter moons (the carme group).
Or in our case:
an open source frame work to mange resources for multiple users running Jupyter notebooks on a Cluster of compute nodes.
![Page 48: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/48.jpg)
Project Carme An open source software stack for multi-user GPU clustersCommon problems in GPU-Cluster oppration:
● Interactive, secure multi user environment● ML and Data Science users want interactive GUI access to compute resources
● Resource Management● How to assign (GPU) resources to competing users?
● User management● Accounting ● Job scheduling ● Resource reservation
● Data I/O● Get user data to compute nodes (I/O Bottleneck)
● Maintenance ● Meet (fast changing and diverse) software demands of users
![Page 49: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/49.jpg)
Project Carme An open source software stack for multi-user GPU clusters
Carme core idea:
● Combine established open source ML and DS tools with HPC back-ends● Use containers
● (for now) Docker ● Use Jupyter Notebooks as main web based GUI-Frontend
● All web front-end (OS independent, no installation on user side needed) ● Use HPC job management and scheduler
● SLURM● Use HPC data I/O technology
● ITWM’s BeeGFS ● Use HPC maintenance and monitoring tools
![Page 50: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/50.jpg)
CC-HPC - Fraunhofer ITWM, Kaiserslautern
Project Carme An open source software stack for multi-user GPU clusters
N0 N1 N2 N3 N4 N5 N6 Nn
BeeGFS Storage Server
BeeGFS Parameter Server
Proxy / Management Server
User DBSLURM
Container StorageBeeGFS Manger
Monitoring
User 2
User 1
...
Compute nodes
![Page 51: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/51.jpg)
HP-DLF
High Performance Deep Learning Framework
Hardware TopologieHardware Topologie
Unterstützte Hardware
Unterstützte Hardware
CompilerCompilerDNN Graph
(Import aus Caffe, TensorFlow, ...)
DNN Graph(Import aus Caffe,
TensorFlow, ...)
Performance Model
Performance Model
Petri Netz Petri Netz Daten ModellDaten Modell
Runtime SystemWorkflow Engine
Scheduler
Runtime SystemWorkflow Engine
Scheduler
DatenflussDatenflussGlobal Address SpaceGlobal Address Space
...
● Scalable● Transparent● generic ● Auto-parallel ● Elastic ● Automatic data flow● Automatic Hardware
selection● Portable● Monitoring● Simulation● New optimization methods
![Page 52: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/52.jpg)
Our asynchronous optimization algorithm
● Sparse communication for multi model optimization● Lower demands on the communication bandwidth● Superior convergence
Optimization for HP-DLF
![Page 53: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/53.jpg)
V A Look a the near Future
For HPC:
● New Hardware Accelerators ● New Interconnects● New Architectures
From ML
● Models are still growing!● Learning to learn
![Page 54: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/54.jpg)
Automatic DesignOf Deep Neural Networks
Layer 1
Layer 2
Layer 3 Layer 4
Layer 5
Layer 6
Basically, a graph optimization problem :
● Select Node Types ● And their Meta-Parameters
● Connect Edges to define the data flow through the graph
Optimization target: minimize test error
Problems: ● Huge and difficult search space ● Each iteration requires training of a DNN
![Page 55: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/55.jpg)
Automatic DesignCurrent Approaches: Reinforcement Learning
Evaluation on CIFAR-10:Better than “State of the Art” (hand designed) performance● After ~12500 iterations● Compute time: ~ 10000 GPU days
![Page 56: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/56.jpg)
DeToL
Deep Topology Learning ● Genetic Algorithms● Reinforcement Learning● Early Stopping● Graph Embedding ● Meta-Learning● Pruning
On application size problems
Data basis generartion
![Page 57: Janis Keuper - tu-dresden.de · Towards Scalable Machine Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern,](https://reader030.fdocuments.in/reader030/viewer/2022041000/5d546eca88c99375068bac92/html5/thumbnails/57.jpg)
Discussion