Post on 22-Sep-2020
Increasing cluster throughput with Slurm and rCUDA
Federico SillaTechnical University of Valencia
Spain
Increasing cluster throughput with Slurm and rCUDA
Slurm User Group Meeting 2015. September 15th 3/37
Outline
What is “rCUDA”?
st
Slurm User Group Meeting 2015. September 15th 4/37
Basics of GPU computing
Basic behavior of CUDA
GPU
Slurm User Group Meeting 2015. September 15th 5/37
rCUDA: remote GPU virtualization
A software technology that enables a more flexible use of GPUs in computing facilities
No GPU
Slurm User Group Meeting 2015. September 15th 6/37
Basics of remote GPU virtualization
Slurm User Group Meeting 2015. September 15th 7/37
Basics of remote GPU virtualization
Slurm User Group Meeting 2015. September 15th 8/37
Remote GPU virtualization allows a new vision of a GPU deployment, moving from the usual cluster configuration:
Remote GPU virtualization envision
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Interconnection Network
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e CPUGPU GPU
mem
Main
Mem
ory
Network
GPU GPUmem
to the following one ….
Slurm User Group Meeting 2015. September 15th 9/37
Physicalconfiguration
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
oryNetwork
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
PCI-e
CPU
Main
Mem
ory
Network
Interconnection Network
Logical connections
Logicalconfiguration
Remote GPU virtualization envision
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
Interconnection Network
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
CPU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e
Slurm User Group Meeting 2015. September 15th 10/37
Several efforts have been made to implement remote GPU virtualization during the last years:
rCUDA (CUDA 7.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vCUDA (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3) V-GPU (CUDA 4.0)
Remote GPU virtualization frameworks
Publicly available
NOT publicly available
Slurm User Group Meeting 2015. September 15th 11/37
Outline
Is “remote GPU virtualization” useful?
nd
Slurm User Group Meeting 2015. September 15th 12/37
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
Without GPU virtualization
GPU virtualization is useful for multi-GPU applications
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
PCI-e
CPU
Main
Mem
ory
Network
Interconnection Network
Logical connectionsPC
I-e CPU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
Interconnection Network
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
oryNetwork
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
With GPU virtualization
Many GPUs in the cluster can be provided
to the application
Only the GPUs in the node can be provided
to the application
1: more GPUs for a single application
Slurm User Group Meeting 2015. September 15th 13/37
1: more GPUs for a single application
64GPUs!
Slurm User Group Meeting 2015. September 15th 14/37
MonteCarlo Multi-GPU (from NVIDIA samples)
Loweris better
Higher is better
1: more GPUs for a single application
FDR InfiniBand +
NVIDIA Tesla K20
Slurm User Group Meeting 2015. September 15th 15/37
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
Interconnection Network
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
PCI-e C
PU
GPU GPUmem
Main
Mem
ory
Network
GPU GPUmem
CPU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e C
PU
Main
Mem
ory
Network
GPU GPUmem
GPU GPUmem
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
oryNetwork
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
CPU
Main
Mem
ory
Network
PCI-e
PCI-e
CPU
Main
Mem
ory
Network
Interconnection Network
Logical connections
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
GPU GPUmem
Physicalconfiguration
Logicalconfiguration
2: increased cluster performance
Slurm User Group Meeting 2015. September 15th 16/37
3: easier cluster upgrade
No GPU
• A cluster without GPUs may be easily upgraded to use GPUs with rCUDA
Slurm User Group Meeting 2015. September 15th 17/37
3: easier cluster upgrade
• A cluster without GPUs may be easily upgraded to use GPUs with rCUDA
GPU-enabled
Slurm User Group Meeting 2015. September 15th 18/37
Box A has 4 GPUs but only one is busy Box B has 8 GPUs but only two are busy
1. Move jobs from Box B to Box A and switch off Box B
2. Migration should be transparent to applications (decided by the global scheduler)
TRUE GREEN GPU COMPUTING
4: GPU task migration
Box A
Box B
Slurm User Group Meeting 2015. September 15th 19/37
5: virtual machines can easily access GPUs
High performance
network available
Low performance
network available
Slurm User Group Meeting 2015. September 15th 20/37
Outline
Cons of “remote GPU virtualization”?
rd
Slurm User Group Meeting 2015. September 15th 21/37
The main GPU virtualization drawback is the reduced bandwidth to the remote GPU
Problem with remote GPU virtualization
No GPU
Slurm User Group Meeting 2015. September 15th 22/37
Using InfiniBand networks
Slurm User Group Meeting 2015. September 15th 23/37
H2D pageable D2H pageable
H2D pinned
Almost 100% of available BW
D2H pinned
Almost 100% of available BW
rCUDA EDR Orig rCUDA EDR OptrCUDA X3 Orig rCUDA X3 Opt
Impact of optimizations on rCUDA
Slurm User Group Meeting 2015. September 15th 24/37
Effect of rCUDA optimizations on applications
• Several applications executed with CUDA and rCUDA• K20 GPU and FDR InfiniBand• K40 GPU and EDR InfiniBand
Slurm User Group Meeting 2015. September 15th 25/37
Outline
What happens at the cluster level?
th
Slurm User Group Meeting 2015. September 15th 26/37
• GPUs can be shared among jobs running in remote clients• Job scheduler required for coordination• Slurm
App 2App 1
App 3App 4
rCUDA at cluster level … Slurm
App 6App 5
App 7App 8App 9
Slurm User Group Meeting 2015. September 15th 27/37
• A new resource has been added: rgpu
rCUDA at cluster level … Slurm
slurm.confSelectType = select / cons_rgpuSelectTypeParameters = CR_COREGresTypes = rgpu [,gpu]
NodeName = node1 NodeHostname = node1CPUs =12 Sockets =2 CoresPerSocket =6ThreadsPerCore =1 RealMemory =32072Gres = rgpu :1[ , gpu :1]
gres.confName = rgpu File =/ dev/ nvidia0 Cuda =3.5 Mem =4726 M[ Name =gpu File =/ dev/ nvidia0 ]
New submission options:--rcuda-mode=(shared|excl)--gres=rgpu(:X(:Y)?(:Z)?)?
X = [1-9]+[0-9]*Y = [1-9]+[0-9]*[ kKmMgG]Z = [1-9]\.[0-9](cc|CC)
Slurm User Group Meeting 2015. September 15th 28/37
Non
-GPU
Short execution time
Long execution time
Ongoing work: studying rCUDA+Slurm
Analysis of different GPU assignment policies Based on GPU-memory occupancy Based on GPU utilization
Applications used for tests: GPU-Blast (21 seconds; 1 GPU; 1599 MB) LAMMPS (15 seconds; 4 GPUs; 876 MB) MCUDA-MEME (165 seconds; 4 GPUs; 151 MB) GROMACS (167 seconds) NAMD (11 minutes) BarraCUDA (10 minutes; 1 GPU; 3319 MB) GPU-LIBSVM (5 minutes; 1GPU; 145 MB) MUMmerGPU (5 minutes; 1GPU; 2804 MB)
Set 1
Set 2
Slurm User Group Meeting 2015. September 15th 29/37
InfiniBand ConnectX-3 FDR based cluster Dual socket E5-2620v2 Intel Xeon based nodes:
1 node without GPU (Slurm controller) 16 nodes with NVIDIA K20 GPU
Three workloads: Set 1 Set 2 Set 1 + Set 2
Three workload sizes: Small (100 jobs) Medium (200 jobs) Large (400 jobs)
1 node hosting the main Slurm
controller
16 nodes with one K20 GPU
each
Test bench for studying rCUDA+Slurm
Slurm User Group Meeting 2015. September 15th 30/37
Results for execution time 8 nodes
16 nodes
Loweris better
4 nodes
Results for Set 1
Slurm User Group Meeting 2015. September 15th 31/37
Reducing the amount of GPUs
35% Less
39% Less
41% Less
Slurm User Group Meeting 2015. September 15th 32/37
Results for energy consumption8 nodes
16 nodes
Loweris better
4 nodes
Results for Set 1
Slurm User Group Meeting 2015. September 15th 33/37
Energy when removing GPUs
39% Less
42% Less
44% Less
Slurm User Group Meeting 2015. September 15th 34/37
Outline
… in summary …
th
Slurm User Group Meeting 2015. September 15th 35/37
• High Throughput Computing• Sharing remote GPUs makes applications
to execute slower … BUT more throughput (jobs/time) is achieved
• Green Computing• GPU migration and application migration allow to devote just the
required computing resources to the current workload
• More flexible system upgrades• GPU and CPU updates become independent
from each other. Attaching GPU boxes to non GPU-enabled clusters is possible
• Datacenter administrators can choose between HPC and HTC
Slurm + rCUDA allow …
Slurm User Group Meeting 2015. September 15th 36/37
Get a free copy of rCUDA athttp://www.rcuda.net
@rcuda_
More than 650 requests world wide
Sergio Iserte Carlos Reaño Javier Prades Fernando Campos
rCUDA is owned by Technical University of Valencia
Slurm User Group Meeting 2015. September 15th 37/37
Thanks!Questions?