HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using...

HPC Workflows Using ContainersHPC Advisory Council, Lugano, Switzerland

Lucas Benedicic, Felipe A. Cruz, Alberto Madonna, Kean Mariotti - CSCSApril 12th, 2017

Outline

1. Background

2. GPU Support

3. MPI Support

4. Deployments at Scale

5. Conclusion

2

Background

3

About Docker

▪ Process Container▪ It uses Linux kernel features to create semi-isolated “containers”.▪ Captures all application requirements.

▪ Image Management▪ Easy-to-use recipe file.▪ Version-control driven image creation.

▪ Environment▪ Push and Pull images from a community-driven Hub

(i.e., DockerHub)

4

About Shifter (1)

▪ A runtime to increase flexibility and usability of HPC systems by enabling the deployment of Docker-like Linux containers

▪ Originally developed at NERSC by D. Jacobsen and S. Canon

▪ Flexibility▪ Enable the definition of complex software stacks using different Linux flavors▪ Develop an application on your laptop and run it on an HPC system

▪ Integration▪ Availability of shared resources (e.g., parallel file systems, accelerator devices

and network interfaces)

▪ Compatibility▪ Integration with public image repositories, e.g., DockerHub▪ Improving result reproducibility

5

About Shifter (2)

▪ But containers are hardware- and platform-agnostic by design▪ How do we go about accessing specialized hardware like GPUs?

▪ CSCS and NVIDIA co-designed a solution that provides:▪ direct access to the GPU device characters;▪ automatic discovery of the required libraries at runtime;▪ NVIDIA’s DGX-1 software stack is based on this solution

▪ CSCS extended this design to the MPI stack▪ Supports different versions MPICH-based implementations

6

GPU Support

7

FROM nvidia/cuda:8.0

RUN apt-get update && apt-get install -y --no-install-recommends \ cuda-samples-$CUDA_PKG_VERSION && \ rm -rf /var/lib/apt/lists/*

RUN (cd /usr/local/cuda/samples/1_Utilities/deviceQuery && make)RUN (cd /usr/local/cuda/samples/1_Utilities/deviceQueryDrv && make)

GPU device access (1)

▪ Starting simple using the official Nvidia CUDA 8 image

8

$ nvidia-docker pull ethcscs/dockerfiles:cudasamples8.08.0-devel-ubuntu14.04: Pulling from ethcscs/dockerfiles

$ nvidia-docker run ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery


▪ Docker on a Laptop

9

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 940MX" CUDA Driver Version / Runtime Version 8.0 / 8.0 [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce 940MXResult = PASS

$ shifterimg pull ethcscs/dockerfiles:cudasamples8.0Pulling Image: docker:ethcscs/dockerfiles:cudasamples8.0 …

$ srun shifter –-image=ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery


▪ Same image deployed on a GPU cluster with Shifter and SLURM

10

/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...



Device 0: "Tesla K40m" [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K40mResult = PASS


11

$ export CUDA_VISIBLE_DEVICES=0,2$ srun shifter –-image=ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery

▪ On the GPU device numbering▪ Must-have for application portability▪ A containerized application will consistently access the GPUs starting from ID 0 (zero)

/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...



Device 0: "Tesla K40m" [...]Device 1: "Tesla K80" [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K80Result = PASS


Laptop* GPU cluster (K40)

Multi-GPU cluster (K40-K80)

Piz Daint (P100)

Native 18.34 [GFLOP/s] 46.78x 103.34x 149.01x

Shifter 18.34 [GFLOP/s] 46.80x 103.33x 149.04x

12

*Laptop run using nvidia-docker

▪ N-body double-precision calculation using 200k bodies, single node.▪ GPU-accelerated runs using the official CUDA image from DockerHub.▪ Relative GFLOP/s performance when comparing Laptop results vs different

HPC systems.


▪ Software library capable of building and training neural networks

▪ GPU-accelerated runs using the official TensorFlow image

▪ Performance relative to the Laptop wall-clock time

13

Test case Laptop* GPU cluster (K40)

Piz Daint (P100)

MNIST, TF tutorial 613 [seconds] 5.89x 17.17x

CIFAR-10, 100k iterations 23359 [seconds] 2.62x 3.73x

*Laptop run using nvidia-docker

MPI Support

14

MPI Support (1)

▪ Based on the MPICH, ABI-compatibility initiative▪ MPICH v3.1 (released February 2014)▪ IBM MPI v2.1 (released December 2014)▪ Intel MPI Library v5.0 (released June 2014)▪ CRAY MPT v7.0.0 (released June 2014)▪ MVAPICH2 v2.0 (released June 2014)

15

$ srun -n 2 shifter --mpi --image=osu-benchmarks-image ./osu_latency

■ MPI library from the container is swapped by Shifter at run time.■ ABI-compatible MPI from the host system is used.■ Hardware acceleration is enabled.

MPI Support (2)

16

GPU + MPI Support - PyFR

▪ Python based framework for solving advection-diffusion type problems on streaming architectures

▪ 2016 Gordon Bell Prize finalist application

▪ Successful GPU- and MPI-accelerated runs using the same container image

▪ Parallel efficiency for a 10-GB test case on different systems

17

Number of nodes Laptop GPU cluster (K40)

Piz Daint (P100)

1 - 1.000 1.000

2 - 0.987 0.975

4 - - 0.964

8 - - 0.927

16 - - 0.874

Deployments at Scale

18

Pynamic*

▪ Test startup time of workloads by simulating DLL behaviour of Python applications

▪ Compare wall-clock time for container and native

▪ Wall-clock time of pynamic test includes: pynamic startup, python module import, and visit of simulated DLLs

▪ Over 3000 MPI processes

19

*Pynamic parameters: 495 shared object files; 1950 avg. functions per object; 215 math like library files; 1950 avg. math functions; function name lenght ~100 characters.

Conclusion

20

Conclusion

▪ Containers are here to stay.

▪ The Docker-Shifter combo takes us closer to the turn-key, cloud-based like computing with scalability and high-performance.

▪ The showed use cases highlighted:▪ bare metal provisioning;▪ ready to use, high-performance software stacks;▪ network file systems support;▪ access to hardware accelerators like GPUs and high-speed interconnect through MPI.

21

Thank you for your attention

22

HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using...

Documents

Transcript of HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using...