Operational Robustness of Accelerator Aware...

16

Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Upload
others
Category

Documents
view
2
download
0

Embed Size (px):

Transcript of Operational Robustness of Accelerator Aware...

Page 1: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Operational Robustness of Accelerator Aware MPI

Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 2: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Computing Systems @ CSCS

© CSCS 2014 2

and several T&D systems (incl. an IMB IDataPlex M3 with GPU and two dense GPU servers) plus networking and storage infrastructure …

http://www.cscs.ch/computers User Lab: Cray XC30 with GPU devices User Lab: Cray XE6 User Lab R&D: Cray XK7 with GPU devices User Lab: InfiniBand Cluster User Lab: InfiniBand Cluster User Lab: SGI Altix User Lab: Cray XMT User Lab: InfiniBand Cluster with GPU devices Meteo Swiss Cray XE6 EPFL Blue Brain Project IBM BG/Q & viz cluster PASC InfiniBand Cluster LCG Tier2 InfiniBand Cluster

2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 3: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Users & Operational Responsibilities

•  Users priorities:

– Robust and sustainable performance for production-level simulations

– Debugging and performance measurement tools to identify and isolate issues (e.g. TAU, Vampir, vendor tools, DDT, Totalview, etc.)

•  24/7 operational support considerations:

– Monitoring for degradation and failures, isolate components as needed (e.g. Ganglia, customized vendor interfaces)

– Quick diagnostics and fixes of known problems – Alerting mechanisms for on-call services (e.g. Nagios)

© CSCS 2014 3

# Realities of using bleeding edge technologies Tools primarily available for non-‐accelerated clusters running MPI & OpenMP applications plus processors, memories, NICs, ...

2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 4: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Quick Intro to GPU Aware MPI •  Users point of view

– MPI_xxxx (data pointers on GPU, …) – No need for explicit copies to/from device

•  For details and up-to-date info please refer to the MUG ‘14 tutorial yesterday (25-Aug-2014)

•  Usage of GPU aware MPI

– GPU accelerated MD packages & other CUDA applications @ CSCS – OpenACC applications, example, S3D

© CSCS 2014 4 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

First CUDA compiler SDK was released in 2007 GPU Aware MVAPICH2 became publically available in 2010 PGI first OpenACC compiler was made available in 2012 MPI standard was first published in 1993

Page 5: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Evolution of GPU Awareness in MVAPICH2 •  MVAPICH2 1.6 (2010-2011)

– Optimization and enhanced performance for clusters with NVIDIA GPU adapters (with and without GPUDirect technology)

•  MVAPICH2 1.8 (2011-2012)

– Intra-node communication from GPU device buffers using CUDA IPC for better performance and correctness

– Enabled shared memory communication for host transfers when CUDA is enabled

– Optimized and tuned collectives for GPU device buffers – Enhanced pipelined inter-node device transfers – Non-contiguous datatype support in point-to-point and collective

communication from GPU buffers

© CSCS 2014 5 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 6: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Evolution of GPU Awareness in MVAPICH2 •  MVAPICH2 1.9 (2012-2013)

– Pipelined design for intra-node communication performance – Non-blocking CUDA copies for inter-node communication

performance – Optimal communication channel selection (intra-IOH & inter-IOH)

•  MVAPICH2 & MVAPICH2-GDR 2.0 (2013-2014)

– Dynamic CUDA initialization & support for GPU device selection – Multi-rail support for GPU communication – Initialize GPU resources only when used by MPI transfer – Optimized sub-array data-type processing for GPU-to-GPU

communication – Non-blocking streams in asynchronous CUDA transfers – MPI-3 RMA (one-sided) and atomic operations using GPU buffers

with GPUDirect RDMA and pipelining – NVIDIA GDR-COPY module & loopback design for inter-node

communication © CSCS 2014 6 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 7: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

GPU Aware MPI in Action (1 GPU / node)

© CSCS 2014 7

GPU CPU NIC GPU CPU NIC

network

Inter-node GPU Direct RDMA

2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 8: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

GPU Aware MPI in Action (2 GPUs / node)

© CSCS 2014 8

GPU CPU NIC GPU CPU NIC

network

GPU CPU GPU CPU

QPI QPI

Inter-node GPU Direct RDMA Inter-IOH shared memory

2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 9: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

GPU Aware MPI in Action (8 GPUs / node)

© CSCS 2014 9

GPU CPU NIC GPU CPU NIC

network

GPU CPU GPU CPU

QPI QPI GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

GPU

Inter-node GPU Direct RDMA Inter-IOH shared memory Intra-IOH IPC

2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 10: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Performance measurements surprises

© CSCS 2014 10

0

200

400

600

800

1000

1200

1400

1600

Late

ncy

(u

sec)

Message size (bytes)

Min (1 GPU/node) Avg. (1 GPU/node) Min (2 GPU/node) Avg. (2 GPU/node) Min (8 GPU/node) Avg. (8 GPU/node)

2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 11: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Identifying affinities (CPU, GPU, memory, NIC, …)

© CSCS 2014 11 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

nvidia-healthmon output for a node with 8 GPU devices and two Intel Xeon sockets

Page 12: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Using binding (CPU, GPU, memory, etc.)

•  Use hwloc to find socket, CPU, hw thread & GPU (CUDA core & NVML bindings)

– hwloc 1.9 (PCI & CUDA enabled version) – MPI can use hwloc-topology (extends to GPU?)

•  Manual binding options

– Using numactl – Using MPI_LOCAL_RANK info – Highly experimental

•  Performance evaluation results with bindings

– Consistent min and max values – Variability still high with some improvements in avg. latencies

© CSCS 2014 12 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 13: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Using nvidia-healthmon (removing MPI effect)

© CSCS 2014 13

… PCI Bandwidth Host-to-GPU pinned memory bandwidth: 6235.334473 MB/s GPU-to-host pinned memory bandwidth: 6399.732422 MB/s Bidirectional pinned memory bandwidth: 10990.615234 MB/s Result: SUCCESS Peer to Peer Bandwidth Peer-to-peer access between devices 0 and 1 is supported Bandwidth from device 0 to 1: 6385.51 MB/s Bandwidth from device 1 to 0: 6364.49 MB/s Bidirectional bandwidth between device 0 and 1: 10449.60 MB/s Peer-to-peer access between devices 0 and 2 is supported Bandwidth from device 0 to 2: 5511.22 MB/s GPU has a lower peer bandwidth (5511.22 MB/s) than expected (6000.00 MB/s) Bandwidth from device 2 to 0: 5728.75 MB/s GPU has a lower peer bandwidth (5728.75 MB/s) than expected (6000.00 MB/s) Bidirectional bandwidth between device 0 and 2: 7044.35 MB/s Peer-to-peer access between devices 0 and 3 is supported Bandwidth from device 0 to 3: 5511.00 MB/s GPU has a lower peer bandwidth (5511.00 MB/s) than expected (6000.00 MB/s) Bandwidth from device 3 to 0: 5729.22 MB/s GPU has a lower peer bandwidth (5729.22 MB/s) than expected (6000.00 MB/s) Bidirectional bandwidth between device 0 and 3: 7044.67 MB/s Peer-to-peer access between devices 0 and 4 is not supported. Peer-to-peer transfer between devices 0 and 4 will use host memory. Bandwidth from device 0 to 4: 5726.79 MB/s Bandwidth from device 4 to 0: 5746.19 MB/s Bidirectional bandwidth between device 0 and 4: 10082.57 MB/s …. Result: WARNING

Variability

2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 14: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Next Steps to minimize variability

•  Understand PCIe technologies & impact on on- and off-node communication

– Extend user and system level tools accordingly

•  Understand behavior of driver technologies

– Extend probes/alerts & tool interfaces

•  Better integration into resource managers, e.g. SLURM

– Portable solutions—no wizardry involved

© CSCS 2014 14 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 15: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Final Thoughts

© CSCS 2014 15

Time

Perf

orm

ance

Success

Time

Aver

age

slow

dow

n

Success

Me wearing a computer scientist hat

Me wearing an operational staff hat

2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Page 16: Operational Robustness of Accelerator Aware MPImug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2017-07-18 · Quick Intro to GPU Aware MPI • Users point of view – MPI_xxxx

Thank you

© CSCS 2014 16 2nd Annual MVAPICH User Group (MUG) Meeting, 2014

Checkpointing a Subsystem Remotely - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Northeastern University, Boston, USA and Universite F´ ed´ erale Toulouse Midi-Pyr´

Checkpointing a Subsystem Remotely - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · Northeastern University, Boston, USA and Universite F´ ed´ erale Toulouse Midi-Pyr´

HPC Virtualization: Control Your Stack - MUG :: Homemug.mvapich.cse.ohio-state.edu/.../2016/WagnerMUG2016.pdf · 2017-07-18 · HPC Virtualization: Control Your Stack Rick Wagner

HPC Virtualization: Control Your Stack - MUG :: Homemug.mvapich.cse.ohio-state.edu/.../2016/WagnerMUG2016.pdf · 2017-07-18 · HPC Virtualization: Control Your Stack Rick Wagner

How to Boost the Performance of Your MPI and PGAS …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2018-08-09 · Network Based Computing Laboratory MVAPICH User Group (MUG)

How to Boost the Performance of Your MPI and PGAS …mug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2018-08-09 · Network Based Computing Laboratory MVAPICH User Group (MUG)

3.imimg.com3.imimg.com/data3/JQ/XU/MY-880799/corporate-catalogue-2014.pdf · Mugs 'Rain:J Mug as MEN 601 Mug wp Mug MM Mug W Mug Cl Mug MM Mug Mug GL Mug MM CL Mug CLS White custom

3.imimg.com3.imimg.com/data3/JQ/XU/MY-880799/corporate-catalogue-2014.pdf · Mugs 'Rain:J Mug as MEN 601 Mug wp Mug MM Mug W Mug Cl Mug MM Mug Mug GL Mug MM CL Mug CLS White custom

HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

HPC, GPUS AND NETWORKING - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/rossetti.pdfGPUDIRECT RDMA + INFINIBAND + CUDA-AWARE MPI Less overhead = lower

MVAPICH at Petascale: Experiences in Production on the ...mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/... · MVAPICH at Petascale: Experiences in Production

MVAPICH at Petascale: Experiences in Production on the ...mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/... · MVAPICH at Petascale: Experiences in Production

MPI Performance Engineering through the Integration of ...mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/17/... · MUG 2017 MPI Performance Engineering through the

MPI Performance Engineering through the Integration of ...mug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/17/... · MUG 2017 MPI Performance Engineering through the

Distributed Algorithms for GPU Molecular Dynamicsmug.mvapich.cse.ohio-state.edu/static/media/mug/...host-memory MPI cuda-aware MPI GPUDirectRDMA 1 2 4 8 16 32 0 500 1000 1500 2000

Distributed Algorithms for GPU Molecular Dynamicsmug.mvapich.cse.ohio-state.edu/static/media/mug/...host-memory MPI cuda-aware MPI GPUDirectRDMA 1 2 4 8 16 32 0 500 1000 1500 2000

AI Bridging Cloud Infrastructure (ABCI) and its ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · •Introduce ABCI, an open AI platform –Architecture, software stack and

AI Bridging Cloud Infrastructure (ABCI) and its ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · •Introduce ABCI, an open AI platform –Architecture, software stack and

OpenHPC: Project Overview and Updates - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/... · · 2017-08-22OpenHPC: Project Overview and Updates Karl W. Schulz, ...

OpenHPC: Project Overview and Updates - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/... · · 2017-08-22OpenHPC: Project Overview and Updates Karl W. Schulz, ...

Performance Engineering using MVAPICH and TAUmug.mvapich.cse.ohio-state.edu/static/media/mug/... · TAU Performance System® • Tuning and Analysis Utilities (22+ year project) •

Performance Engineering using MVAPICH and TAUmug.mvapich.cse.ohio-state.edu/static/media/mug/... · TAU Performance System® • Tuning and Analysis Utilities (22+ year project) •

Kapil Arya - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · © 2016 Mesosphere, Inc. All Rights Reserved. 2 Overview Quick Introduction to Checkpointing DMTCP Internals

Kapil Arya - MUGmug.mvapich.cse.ohio-state.edu/static/media/mug/... · © 2016 Mesosphere, Inc. All Rights Reserved. 2 Overview Quick Introduction to Checkpointing DMTCP Internals

The Accelerated Road to Exascale - Ohio State …mug.mvapich.cse.ohio-state.edu/static/media/mug/...• Achieving exascale performance will require immense parallism • Combined thread-level

The Accelerated Road to Exascale - Ohio State …mug.mvapich.cse.ohio-state.edu/static/media/mug/...• Achieving exascale performance will require immense parallism • Combined thread-level

MVAPICH2 on Intel® Omni-Path Architecturemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/20… · Collaborative publications and tutorial material being developed 31

MVAPICH2 on Intel® Omni-Path Architecturemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/20… · Collaborative publications and tutorial material being developed 31

OpenHPC: Community Project Updatesmug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2018-08-09 · ARM tech preview first introduced with SC’16 release Removed tech preview

OpenHPC: Community Project Updatesmug.mvapich.cse.ohio-state.edu/static/media/mug/... · 2018-08-09 · ARM tech preview first introduced with SC’16 release Removed tech preview

DMTCP: System-Level Checkpoint-Restart in User Spacemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/cooperman.pdfDMTCP: System-Level Checkpoint-Restart in User Space

DMTCP: System-Level Checkpoint-Restart in User Spacemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/2014/cooperman.pdfDMTCP: System-Level Checkpoint-Restart in User Space

Innovations in CyberinfrastructureLearning and Workforce ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · (1554005 –co-funded: OAC, DMS, GEO/AGS, CBET PECASE 2019) The

Innovations in CyberinfrastructureLearning and Workforce ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · (1554005 –co-funded: OAC, DMS, GEO/AGS, CBET PECASE 2019) The

Paving the Road to Exascale - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/20… · hardware Functionality: work request setup and instantiation of operations

Paving the Road to Exascale - MUG :: Homemug.mvapich.cse.ohio-state.edu/static/media/mug/presentations/20… · hardware Functionality: work request setup and instantiation of operations

Experiences with Sandia National Laboratories HPC ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · miniGhost 1.0.1 FDM/FVM explicit (halo exchange focus) miniMD 1.2 Molecular

Experiences with Sandia National Laboratories HPC ...mug.mvapich.cse.ohio-state.edu/static/media/mug/... · miniGhost 1.0.1 FDM/FVM explicit (halo exchange focus) miniMD 1.2 Molecular

Context-aware Application Scheduling in Mobile Systems ...newslab.ece.ohio-state.edu/research/resources/p1235-lee.pdf · Context-aware Application Scheduling in Mobile Systems: ...

Context-aware Application Scheduling in Mobile Systems ...newslab.ece.ohio-state.edu/research/resources/p1235-lee.pdf · Context-aware Application Scheduling in Mobile Systems: ...

Languages

Pages

Legal

Copyright © 2022 FDOCUMENTS