Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15....

37
Increasing cluster throughput with Slurm and rCUDA Federico Silla Technical University of Valencia Spain

Transcript of Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15....

Page 1: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Increasing cluster throughput with Slurm and rCUDA

Federico SillaTechnical University of Valencia

Spain

Page 2: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Increasing cluster throughput with Slurm and rCUDA

Page 3: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 3/37

Outline

What is “rCUDA”?

st

Page 4: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 4/37

Basics of GPU computing

Basic behavior of CUDA

GPU

Page 5: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 5/37

rCUDA: remote GPU virtualization

A software technology that enables a more flexible use of GPUs in computing facilities

No GPU

Page 6: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 6/37

Basics of remote GPU virtualization

Page 7: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 7/37

Basics of remote GPU virtualization

Page 8: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 8/37

Remote GPU virtualization allows a new vision of a GPU deployment, moving from the usual cluster configuration:

Remote GPU virtualization envision

PCI-e CPU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

Interconnection Network

PCI-e CPU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e CPU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e CPU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e CPU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e CPUGPU GPU

mem

Main

Mem

ory

Network

GPU GPUmem

to the following one ….

Page 9: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 9/37

Physicalconfiguration

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

oryNetwork

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

PCI-e

CPU

Main

Mem

ory

Network

Interconnection Network

Logical connections

Logicalconfiguration

Remote GPU virtualization envision

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

PCI-e C

PU

Main

Mem

ory

Network

Interconnection Network

PCI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

CPU

Main

Mem

ory

Network

GPU GPUmem

GPU GPUmem

PCI-e C

PU

Main

Mem

ory

Network

GPU GPUmem

GPU GPUmem

PCI-e C

PU

Main

Mem

ory

Network

GPU GPUmem

GPU GPUmem

PCI-e

Page 10: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 10/37

Several efforts have been made to implement remote GPU virtualization during the last years:

rCUDA (CUDA 7.0) GVirtuS (CUDA 3.2) DS-CUDA (CUDA 4.1) vCUDA (CUDA 1.1) GViM (CUDA 1.1) GridCUDA (CUDA 2.3) V-GPU (CUDA 4.0)

Remote GPU virtualization frameworks

Publicly available

NOT publicly available

Page 11: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 11/37

Outline

Is “remote GPU virtualization” useful?

nd

Page 12: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 12/37

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

Without GPU virtualization

GPU virtualization is useful for multi-GPU applications

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

PCI-e

CPU

Main

Mem

ory

Network

Interconnection Network

Logical connectionsPC

I-e CPU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

Interconnection Network

PCI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e C

PU

GPU GPUmem

Main

Mem

oryNetwork

GPU GPUmem

PCI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

With GPU virtualization

Many GPUs in the cluster can be provided

to the application

Only the GPUs in the node can be provided

to the application

1: more GPUs for a single application

Page 13: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 13/37

1: more GPUs for a single application

64GPUs!

Page 14: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 14/37

MonteCarlo Multi-GPU (from NVIDIA samples)

Loweris better

Higher is better

1: more GPUs for a single application

FDR InfiniBand +

NVIDIA Tesla K20

Page 15: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 15/37

GPU GPUmem

GPU GPUmem

PCI-e C

PU

Main

Mem

ory

Network

Interconnection Network

PCI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

PCI-e C

PU

GPU GPUmem

Main

Mem

ory

Network

GPU GPUmem

CPU

Main

Mem

ory

Network

GPU GPUmem

GPU GPUmem

PCI-e C

PU

Main

Mem

ory

Network

GPU GPUmem

GPU GPUmem

PCI-e C

PU

Main

Mem

ory

Network

GPU GPUmem

GPU GPUmem

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

oryNetwork

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

CPU

Main

Mem

ory

Network

PCI-e

PCI-e

CPU

Main

Mem

ory

Network

Interconnection Network

Logical connections

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

GPU GPUmem

Physicalconfiguration

Logicalconfiguration

2: increased cluster performance

Page 16: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 16/37

3: easier cluster upgrade

No GPU

• A cluster without GPUs may be easily upgraded to use GPUs with rCUDA

Page 17: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 17/37

3: easier cluster upgrade

• A cluster without GPUs may be easily upgraded to use GPUs with rCUDA

GPU-enabled

Page 18: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 18/37

Box A has 4 GPUs but only one is busy Box B has 8 GPUs but only two are busy

1. Move jobs from Box B to Box A and switch off Box B

2. Migration should be transparent to applications (decided by the global scheduler)

TRUE GREEN GPU COMPUTING

4: GPU task migration

Box A

Box B

Page 19: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 19/37

5: virtual machines can easily access GPUs

High performance

network available

Low performance

network available

Page 20: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 20/37

Outline

Cons of “remote GPU virtualization”?

rd

Page 21: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 21/37

The main GPU virtualization drawback is the reduced bandwidth to the remote GPU

Problem with remote GPU virtualization

No GPU

Page 22: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 22/37

Using InfiniBand networks

Page 23: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 23/37

H2D pageable D2H pageable

H2D pinned

Almost 100% of available BW

D2H pinned

Almost 100% of available BW

rCUDA EDR Orig rCUDA EDR OptrCUDA X3 Orig rCUDA X3 Opt

Impact of optimizations on rCUDA

Page 24: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 24/37

Effect of rCUDA optimizations on applications

• Several applications executed with CUDA and rCUDA• K20 GPU and FDR InfiniBand• K40 GPU and EDR InfiniBand

Page 25: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 25/37

Outline

What happens at the cluster level?

th

Page 26: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 26/37

• GPUs can be shared among jobs running in remote clients• Job scheduler required for coordination• Slurm

App 2App 1

App 3App 4

rCUDA at cluster level … Slurm

App 6App 5

App 7App 8App 9

Page 27: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 27/37

• A new resource has been added: rgpu

rCUDA at cluster level … Slurm

slurm.confSelectType = select / cons_rgpuSelectTypeParameters = CR_COREGresTypes = rgpu [,gpu]

NodeName = node1 NodeHostname = node1CPUs =12 Sockets =2 CoresPerSocket =6ThreadsPerCore =1 RealMemory =32072Gres = rgpu :1[ , gpu :1]

gres.confName = rgpu File =/ dev/ nvidia0 Cuda =3.5 Mem =4726 M[ Name =gpu File =/ dev/ nvidia0 ]

New submission options:--rcuda-mode=(shared|excl)--gres=rgpu(:X(:Y)?(:Z)?)?

X = [1-9]+[0-9]*Y = [1-9]+[0-9]*[ kKmMgG]Z = [1-9]\.[0-9](cc|CC)

Page 28: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 28/37

Non

-GPU

Short execution time

Long execution time

Ongoing work: studying rCUDA+Slurm

Analysis of different GPU assignment policies Based on GPU-memory occupancy Based on GPU utilization

Applications used for tests: GPU-Blast (21 seconds; 1 GPU; 1599 MB) LAMMPS (15 seconds; 4 GPUs; 876 MB) MCUDA-MEME (165 seconds; 4 GPUs; 151 MB) GROMACS (167 seconds) NAMD (11 minutes) BarraCUDA (10 minutes; 1 GPU; 3319 MB) GPU-LIBSVM (5 minutes; 1GPU; 145 MB) MUMmerGPU (5 minutes; 1GPU; 2804 MB)

Set 1

Set 2

Page 29: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 29/37

InfiniBand ConnectX-3 FDR based cluster Dual socket E5-2620v2 Intel Xeon based nodes:

1 node without GPU (Slurm controller) 16 nodes with NVIDIA K20 GPU

Three workloads: Set 1 Set 2 Set 1 + Set 2

Three workload sizes: Small (100 jobs) Medium (200 jobs) Large (400 jobs)

1 node hosting the main Slurm

controller

16 nodes with one K20 GPU

each

Test bench for studying rCUDA+Slurm

Page 30: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 30/37

Results for execution time 8 nodes

16 nodes

Loweris better

4 nodes

Results for Set 1

Page 31: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 31/37

Reducing the amount of GPUs

35% Less

39% Less

41% Less

Page 32: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 32/37

Results for energy consumption8 nodes

16 nodes

Loweris better

4 nodes

Results for Set 1

Page 33: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 33/37

Energy when removing GPUs

39% Less

42% Less

44% Less

Page 34: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 34/37

Outline

… in summary …

th

Page 35: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 35/37

• High Throughput Computing• Sharing remote GPUs makes applications

to execute slower … BUT more throughput (jobs/time) is achieved

• Green Computing• GPU migration and application migration allow to devote just the

required computing resources to the current workload

• More flexible system upgrades• GPU and CPU updates become independent

from each other. Attaching GPU boxes to non GPU-enabled clusters is possible

• Datacenter administrators can choose between HPC and HTC

Slurm + rCUDA allow …

Page 36: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 36/37

Get a free copy of rCUDA athttp://www.rcuda.net

@rcuda_

More than 650 requests world wide

Sergio Iserte Carlos Reaño Javier Prades Fernando Campos

rCUDA is owned by Technical University of Valencia

Page 37: Increasing cluster throughput with Slurm and rCUDA · Slurm User Group Meeting 2015. September 15. th. 35/37 • High Throughput Computing • Sharing remote GPUs makes applications

Slurm User Group Meeting 2015. September 15th 37/37

Thanks!Questions?