HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using...
Transcript of HPC Workflows Using Containers - HPC Advisory Council · 2020-01-15 · GPU-accelerated runs using...
HPC Workflows Using ContainersHPC Advisory Council, Lugano, Switzerland
Lucas Benedicic, Felipe A. Cruz, Alberto Madonna, Kean Mariotti - CSCSApril 12th, 2017
Outline
1. Background
2. GPU Support
3. MPI Support
4. Deployments at Scale
5. Conclusion
2
Background
3
About Docker
▪ Process Container▪ It uses Linux kernel features to create semi-isolated “containers”.▪ Captures all application requirements.
▪ Image Management▪ Easy-to-use recipe file.▪ Version-control driven image creation.
▪ Environment▪ Push and Pull images from a community-driven Hub
(i.e., DockerHub)
4
About Shifter (1)
▪ A runtime to increase flexibility and usability of HPC systems by enabling the deployment of Docker-like Linux containers
▪ Originally developed at NERSC by D. Jacobsen and S. Canon
▪ Flexibility▪ Enable the definition of complex software stacks using different Linux flavors▪ Develop an application on your laptop and run it on an HPC system
▪ Integration▪ Availability of shared resources (e.g., parallel file systems, accelerator devices
and network interfaces)
▪ Compatibility▪ Integration with public image repositories, e.g., DockerHub▪ Improving result reproducibility
5
About Shifter (2)
▪ But containers are hardware- and platform-agnostic by design▪ How do we go about accessing specialized hardware like GPUs?
▪ CSCS and NVIDIA co-designed a solution that provides:▪ direct access to the GPU device characters;▪ automatic discovery of the required libraries at runtime;▪ NVIDIA’s DGX-1 software stack is based on this solution
▪ CSCS extended this design to the MPI stack▪ Supports different versions MPICH-based implementations
6
GPU Support
7
FROM nvidia/cuda:8.0
RUN apt-get update && apt-get install -y --no-install-recommends \ cuda-samples-$CUDA_PKG_VERSION && \ rm -rf /var/lib/apt/lists/*
RUN (cd /usr/local/cuda/samples/1_Utilities/deviceQuery && make)RUN (cd /usr/local/cuda/samples/1_Utilities/deviceQueryDrv && make)
GPU device access (1)
▪ Starting simple using the official Nvidia CUDA 8 image
8
$ nvidia-docker pull ethcscs/dockerfiles:cudasamples8.08.0-devel-ubuntu14.04: Pulling from ethcscs/dockerfiles
$ nvidia-docker run ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery
GPU device access (2)
▪ Docker on a Laptop
9
./deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce 940MX" CUDA Driver Version / Runtime Version 8.0 / 8.0 [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = GeForce 940MXResult = PASS
$ shifterimg pull ethcscs/dockerfiles:cudasamples8.0Pulling Image: docker:ethcscs/dockerfiles:cudasamples8.0 …
$ srun shifter –-image=ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery
GPU device access (3)
▪ Same image deployed on a GPU cluster with Shifter and SLURM
10
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "Tesla K40m" [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 1, Device0 = Tesla K40mResult = PASS
GPU device access (4)
11
$ export CUDA_VISIBLE_DEVICES=0,2$ srun shifter –-image=ethcscs/dockerfiles:cudasamples8.0 ./deviceQuery
▪ On the GPU device numbering▪ Must-have for application portability▪ A containerized application will consistently access the GPUs starting from ID 0 (zero)
/usr/local/cuda/samples/bin/x86_64/linux/release/deviceQuery Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 2 CUDA Capable device(s)
Device 0: "Tesla K40m" [...]Device 1: "Tesla K80" [...]deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 8.0, CUDA Runtime Version = 8.0, NumDevs = 2, Device0 = Tesla K40m, Device1 = Tesla K80Result = PASS
GPU device access (5)
Laptop* GPU cluster (K40)
Multi-GPU cluster (K40-K80)
Piz Daint (P100)
Native 18.34 [GFLOP/s] 46.78x 103.34x 149.01x
Shifter 18.34 [GFLOP/s] 46.80x 103.33x 149.04x
12
*Laptop run using nvidia-docker
▪ N-body double-precision calculation using 200k bodies, single node.▪ GPU-accelerated runs using the official CUDA image from DockerHub.▪ Relative GFLOP/s performance when comparing Laptop results vs different
HPC systems.
GPU device access (6)
▪ Software library capable of building and training neural networks
▪ GPU-accelerated runs using the official TensorFlow image
▪ Performance relative to the Laptop wall-clock time
13
Test case Laptop* GPU cluster (K40)
Piz Daint (P100)
MNIST, TF tutorial 613 [seconds] 5.89x 17.17x
CIFAR-10, 100k iterations 23359 [seconds] 2.62x 3.73x
*Laptop run using nvidia-docker
MPI Support
14
MPI Support (1)
▪ Based on the MPICH, ABI-compatibility initiative▪ MPICH v3.1 (released February 2014)▪ IBM MPI v2.1 (released December 2014)▪ Intel MPI Library v5.0 (released June 2014)▪ CRAY MPT v7.0.0 (released June 2014)▪ MVAPICH2 v2.0 (released June 2014)
15
$ srun -n 2 shifter --mpi --image=osu-benchmarks-image ./osu_latency
■ MPI library from the container is swapped by Shifter at run time.■ ABI-compatible MPI from the host system is used.■ Hardware acceleration is enabled.
MPI Support (2)
16
GPU + MPI Support - PyFR
▪ Python based framework for solving advection-diffusion type problems on streaming architectures
▪ 2016 Gordon Bell Prize finalist application
▪ Successful GPU- and MPI-accelerated runs using the same container image
▪ Parallel efficiency for a 10-GB test case on different systems
17
Number of nodes Laptop GPU cluster (K40)
Piz Daint (P100)
1 - 1.000 1.000
2 - 0.987 0.975
4 - - 0.964
8 - - 0.927
16 - - 0.874
Deployments at Scale
18
Pynamic*
▪ Test startup time of workloads by simulating DLL behaviour of Python applications
▪ Compare wall-clock time for container and native
▪ Wall-clock time of pynamic test includes: pynamic startup, python module import, and visit of simulated DLLs
▪ Over 3000 MPI processes
19
*Pynamic parameters: 495 shared object files; 1950 avg. functions per object; 215 math like library files; 1950 avg. math functions; function name lenght ~100 characters.
Conclusion
20
Conclusion
▪ Containers are here to stay.
▪ The Docker-Shifter combo takes us closer to the turn-key, cloud-based like computing with scalability and high-performance.
▪ The showed use cases highlighted:▪ bare metal provisioning;▪ ready to use, high-performance software stacks;▪ network file systems support;▪ access to hardware accelerators like GPUs and high-speed interconnect through MPI.
21
Thank you for your attention
22