HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using...

22
HPC Workflows Using Containers HPC Advisory Council, Lugano, Switzerland Lucas Benedicic, Felipe A. Cruz, Alberto Madonna, Kean Mariotti - CSCS April 12 th , 2017

Transcript of HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using...

Page 1: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

HPC Workflows Using ContainersHPC Advisory Council, Lugano, Switzerland

Lucas Benedicic, Felipe A. Cruz, Alberto Madonna, Kean Mariotti - CSCSApril 12th, 2017

Page 2: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

Outline

1. Background

2. GPU Support

3. MPI Support

4. Deployments at Scale

5. Conclusion

2

Page 3: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

Background

3

Page 4: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

About Docker

▪ Process Container▪ It uses Linux kernel features to create semi-isolated “containers”.▪ Captures all application requirements.

▪ Image Management▪ Easy-to-use recipe file.▪ Version-control driven image creation.

▪ Environment▪ Push and Pull images from a community-driven Hub

(i.e., DockerHub)

4

Page 5: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

About Shifter (1)

▪ A runtime to increase flexibility and usability of HPC systems by enabling the deployment of Docker-like Linux containers

▪ Originally developed at NERSC by D. Jacobsen and S. Canon

▪ Flexibility▪ Enable the definition of complex software stacks using different Linux flavors▪ Develop an application on your laptop and run it on an HPC system

▪ Integration▪ Availability of shared resources (e.g., parallel file systems, accelerator devices

and network interfaces)

▪ Compatibility▪ Integration with public image repositories, e.g., DockerHub▪ Improving result reproducibility

5

Page 6: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

About Shifter (2)

▪ But containers are hardware- and platform-agnostic by design▪ How do we go about accessing specialized hardware like GPUs?

▪ CSCS and NVIDIA co-designed a solution that provides:▪ direct access to the GPU device characters;▪ automatic discovery of the required libraries at runtime;▪ NVIDIA’s DGX-1 software stack is based on this solution

▪ CSCS extended this design to the MPI stack▪ Supports different versions MPICH-based implementations

6

Page 7: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

GPU Support

7

Page 8: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

FROM nvidia/cuda:8.0

RUN apt-get update && apt-get install -y --no-install-recommends \ cuda-samples-$CUDA_PKG_VERSION && \ rm -rf /var/lib/apt/lists/*

RUN (cd /usr/local/cuda/samples/1_Utilities/deviceQuery && make)RUN (cd /usr/local/cuda/samples/1_Utilities/deviceQueryDrv && make)

GPU device access (1)

▪ Starting simple using the official Nvidia CUDA 8 image

8

Page 9: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

$ nvidia-docker pull ethcscs/dockerfiles:cudasamples8.08.0-devel-ubuntu14.04: Pulling from ethcscs/dockerfiles

$ nvidia-docker run ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery

GPU device access (2)

▪ Docker on a Laptop

9

./deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "GeForce 940MX" CUDA Driver Version / Runtime Version 8.0 / 8.0 [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce 940MXResult = PASS

Page 10: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

$ shifterimg pull ethcscs/dockerfiles:cudasamples8.0Pulling Image: docker:ethcscs/dockerfiles:cudasamples8.0 …

$ srun shifter –-image=ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery

GPU device access (3)

▪ Same image deployed on a GPU cluster with Shifter and SLURM

10

/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 1 CUDA Capable device(s)

Device 0: "Tesla K40m" [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K40mResult = PASS

Page 11: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

GPU device access (4)

11

$ export CUDA_VISIBLE_DEVICES=0,2$ srun shifter –-image=ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery

▪ On the GPU device numbering▪ Must-have for application portability▪ A containerized application will consistently access the GPUs starting from ID 0 (zero)

/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...

CUDA Device Query (Runtime API) version (CUDART static linking)

Detected 2 CUDA Capable device(s)

Device 0: "Tesla K40m" [...]Device 1: "Tesla K80" [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K80Result = PASS

Page 12: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

GPU device access (5)

Laptop* GPU cluster (K40)

Multi-GPU cluster (K40-K80)

Piz Daint (P100)

Native 18.34 [GFLOP/s] 46.78x 103.34x 149.01x

Shifter 18.34 [GFLOP/s] 46.80x 103.33x 149.04x

12

*Laptop run using nvidia-docker

▪ N-body double-precision calculation using 200k bodies, single node.▪ GPU-accelerated runs using the official CUDA image from DockerHub.▪ Relative GFLOP/s performance when comparing Laptop results vs different

HPC systems.

Page 13: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

GPU device access (6)

▪ Software library capable of building and training neural networks

▪ GPU-accelerated runs using the official TensorFlow image

▪ Performance relative to the Laptop wall-clock time

13

Test case Laptop* GPU cluster (K40)

Piz Daint (P100)

MNIST, TF tutorial 613 [seconds] 5.89x 17.17x

CIFAR-10, 100k iterations 23359 [seconds] 2.62x 3.73x

*Laptop run using nvidia-docker

Page 14: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

MPI Support

14

Page 15: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

MPI Support (1)

▪ Based on the MPICH, ABI-compatibility initiative▪ MPICH v3.1 (released February 2014)▪ IBM MPI v2.1 (released December 2014)▪ Intel MPI Library v5.0 (released June 2014)▪ CRAY MPT v7.0.0 (released June 2014)▪ MVAPICH2 v2.0 (released June 2014)

15

$ srun -n 2 shifter --mpi --image=osu-benchmarks-image ./osu_latency

■ MPI library from the container is swapped by Shifter at run time.■ ABI-compatible MPI from the host system is used.■ Hardware acceleration is enabled.

Page 16: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

MPI Support (2)

16

Page 17: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

GPU + MPI Support - PyFR

▪ Python based framework for solving advection-diffusion type problems on streaming architectures

▪ 2016 Gordon Bell Prize finalist application

▪ Successful GPU- and MPI-accelerated runs using the same container image

▪ Parallel efficiency for a 10-GB test case on different systems

17

Number of nodes Laptop GPU cluster (K40)

Piz Daint (P100)

1 - 1.000 1.000

2 - 0.987 0.975

4 - - 0.964

8 - - 0.927

16 - - 0.874

Page 18: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

Deployments at Scale

18

Page 19: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

Pynamic*

▪ Test startup time of workloads by simulating DLL behaviour of Python applications

▪ Compare wall-clock time for container and native

▪ Wall-clock time of pynamic test includes: pynamic startup, python module import, and visit of simulated DLLs

▪ Over 3000 MPI processes

19

*Pynamic parameters: 495 shared object files; 1950 avg. functions per object; 215 math like library files; 1950 avg. math functions; function name lenght ~100 characters.

Page 20: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

Conclusion

20

Page 21: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

Conclusion

▪ Containers are here to stay.

▪ The Docker-Shifter combo takes us closer to the turn-key, cloud-based like computing with scalability and high-performance.

▪ The showed use cases highlighted:▪ bare metal provisioning;▪ ready to use, high-performance software stacks;▪ network file systems support;▪ access to hardware accelerators like GPUs and high-speed interconnect through MPI.

21

Page 22: HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using the official TensorFlow image Performance relative to the Laptop wall-clock time

Thank you for your attention

22