Towards Exascale: Leveraging InfiniBand to …...The performance of the OpenSHMEM jobstart was...

Artem Y. Polyakov, Joshua S. Ladd, Boris I. Karasev

Nov 16, 2017

Towards Exascale: Leveraging InfiniBand to accelerate the

performance and scalability of Slurm jobstart.

Agenda

• Problem description

• Slurm PMIx plugin status update

• Motivation of this work

• What is PMIx?

• RunTime Enviroment (RTE)

• Process Management Interface (PMI)

• PMIx endpoint exchange modes: full modex, direct modex, instant-on

• PMIx plugin (Slurm 16.05)

• High level overview of a Slurm RPC

• PMIx plugin (Slurm 17.11) – revamp of OOB channel

• Direct-connect feature

• Revamp of PMIx plugin collectives

• Early wireup feature

• Performance results for Open MPI

Agenda

• What is PMIx?

Slurm PMIx plugin status update

• Slurm 16.05

• PMIx plugin was provided by Mellanox in Oct, 2015 (commit 3089921)

• Supports PMIx v1.x

• Uses Slurm RPC for Out Of Band (OOB) communication (derived from PMI2 plugin)

• Slurm 17.02

• Bugfixing & maintenance

• Slurm 17.11

• Support for PMIx v2.x

• Support for TCP- and UCX-based communication infrastructure

• UCX: ./configure … --with-ucx=<ucx-path>

3089921ad9c0ad05372f955521b12f0c93ec73ec

0 5 10 15 20

time, s

sRPC/dm

Motivation of this work

• OpenSHMEM jobstart with Slurm PMIx/direct modex (explained below)

• Time to perform shmem_init() is measured

• Significant performance degradation when Process Per Node (PPN) count was reaching available number of cores.

• Profiling identified that the bottleneck is the communication subsystem based on Slurm RPC (sRPC).

Measurements configuration:

• 32 nodes, 20 cores per node

• Varying PPN

• PMIx v1.2

• OMPI v2.x

• Slurm 16.05

Agenda

• What is PMIx?

RunTime Environment (RTE)

LOGICALLY

• MPI program: set of execution branches

• each uniquely identified by a rank

• fully-connected graph

• set of comm. primitives provide the way

for ranks to exchange the data

IMPLEMENTATION:

• execution branch = OS process

• full connectivity is not scalable

• execution branches are mapped to physical resources: nodes/sockets/cores.

• comm. subsystem is heterogeneous: intra-node & inter-node set of communication

channels are different.

• OS processes need to be:

• launched;

• transparently wired up to enable necessary abstraction level;

• controlled (I/O forward, kill, cleanup, prestage, etc.)

• Either MPI implementation or Resource Manager (RM) provides RTE process to address

those issues.

RTE RTE RTE RTERTE daemon

management

process

topic of this talk

Process Management Interface: RTE – application

Connect

ORTEd Hydra

pbsdsh

tm_spawn()

blaunch

lsb_launch()

launch

control

IO forward

Remote

Execution

Service

Unix TORQUE LSF SLURM

qrsh llspawn

LevelerSGE

MPI RTE (MPICH2, MVAPICH, Open MPI ORTE)

Distributed application (MPI, OpenSHMEM, …)Key-value Database

PMIx endpoint exchange modes: full modex

RTE node 1 RTE node 2

. . .proc 0 rank 1 rank K . . .proc K+1 rank K+2 rank (2K+1)

PMIx endpoint exchange modes: full modex (2)

ep0 = open_fabric()ep1 = …

epK = …

ep(K+1) = open…()ep(K+2) = …

ep(2K+1) = … Get fabric

endpoint

epK = …

ep(K+1) = open…()ep(K+2) = …

ep(2K+1) = …

ep(K+1)ep(K+2)

ep(2K+1)Commit keys

to the

local RTE

server

epK = …

ep(K+1) = open…()ep(K+2) = …

ep(2K+1) = …

ep(K+1)ep(K+2)

ep(2K+1)

fence_req(collect)

Fence request

epK = …

ep(K+1) = open…()ep(K+2) = …

ep(2K+1) = …

ep(K+1)ep(K+2)

ep(2K+1)

Allgatherv(ep0,…ep(2K+1)

PMIx_Fence()

fence_resp fence_resp

All keys are on

the node-local

RTE proc

ep0 = …ep1 = …

epK = …

ep(K+1) = open…()ep(K+2) = …

ep(2K+1) = …

ep(K+1)ep(K+2)

ep(2K+1)

fence_req(collect)

fence_req(collect)Allgatherv(ep0,…ep(2K+1)

MPI_Send(K+2)

PMIx_Fence()

fence_resp fence_resp

? rank K+2

ep(K+2) MPI_Send(K+2)

PMIx endpoint exchange modes: direct modex

epK = …

ep(K+1) = open…()ep(K+2) = …

ep(2K+1) = …

ep(K+1)ep(K+2)

ep(2K+1)

MPI_Send(K+2)

? rank K+2

ep(K+2)

MPI_Send(K+2)

? rank K+2

ep(K+2)

PMIx endpoint exchange modes: instant-on (future)

epK = …

ep(K+1) = open…()ep(K+2) = …

ep(2K+1) = …

MPI_Send(K+2)

addr = fabric_ep(rank K+2)

Agenda

• What is PMIx?

• High level overview of a Slurm RPC & analysis.

Slurm RPC workflow (1)

cn01 cn02

slurmd slurmd

Every node has slurmd daemon that controls it. It has a well-known TCP port that allows other

components to communicate with it.

TCP:6818 TCP:6818

cn01 cn02

slurmd slurmd

When a job is launched a SLURM step daemon (stepd) is used to control application processes.

stepd also runs the instance of the PMIx server. stepd opens and listens for a UNIX socket.

TCP:6818 TCP:6818

UNIX:/tmp/pmix.JOBID

rank0 rank2 rankX rank1 rank2 rankX+1

SLURM provides RPC API that allows an easy way to communicate a process on the remote node

without connection establishment:

slurm_forward_data(nodelist, usock_path, len, data)

nodelist SLURM representation of nodenames: cn[01-10,20-30]

usock_path path to a UNIX socket file that the process you are trying to reach is listening

/tmp/pmix.JOBID

len length of a data buffer

data pointer to a data buffer

cn01 cn02

slurmd slurmd

cn01: slurm_forward_data(“cn02”, “/tmp/pmix.JOBID”, len, data)

TCP:6818 TCP:6818

cn01 cn02

slurmd slurmd

TCP:6818 TCP:6818

1) cn01/stepd reaching slurmd

using well-known TCP port

cn01 cn02

slurmd slurmd

TCP:6818 TCP:6818

2) cn02/slurmd is forwarding the

message to the final process inside

the node using UNIX socket

(dedicated thread)

Issue:

In direct modex CPUs of this node are

busy serving application processes.

Our observation identified that this

was causing significant scheduling

delays.

1) cn01/stepd reaching slurmd

using well-known TCP port

Agenda

• What is PMIx?

• High level overview of a Slurm RPC & analysis.

Direct-connect feature

Slurm RPC + ep0

ep0 = open_fabric()

ep1 = …

direct_connect(ep0)

direct + ep1

direct_connect(ep1)

Direct-connect feature:

• the very first message is still sent

using Slurm RPC;

• the endpoint information is

incorporated in the initial message;

• is used on the remote side to

establish direct connection to the

sender;

• all communication from the remote

node will go through this direct

connection;

• if needed – symmetric connection

establishment will be performedREQ: direct

RESP: direct

Direct-connect feature: TCP-based

• The first version of direct-connect was TCP-based

• Slurm RPC is still supported, but needs to be enabled using SLURM_PMIX_DIRECT_CONN environment variable.

The performance of the OpenSHMEM jobstart was significantly improved. Below is the time to perform shmem_init() on

32 nodes with various Process Per Node (PPN) count. sRPC stands for Slurm RPC, dTCP – TCP-based direct-connect.

0 5 10 15 20

time, s

sRPC/dm dTCP/dm

Related environment variables:

SLURM_PMIX_DIRECT_CONN = { true | false }

Enables direct connect, (true by default)

Direct-connect feature: UCX-based

Existing direct-connect infrastructure allowed to use HPC fabric for communication.

Support for UCX porint-to-point communication library (www.openucx.com) was implemented.

Slurm 17.11 should be configured with “--with-ucx=<ucx-path>” to enable UCX support.

Below is the latency measured for the point-to-point exchange* for each of the communication options available in

Slurm 17.11: (a) Slurm RPC (sRPC); (b) TCP-based direct-connect (dTCP); (c) UCX-based direct-connect (dUCX).

SLURM_PMIX_DIRECT_CONN = {true | false}

Enables direct connect, (true by default)

SLURM_PMIX_DIRECT_CONN_UCX = {true | false}

Enables direct connect, (true by default)1.E-6

1.E+0 1.E+2 1.E+4 1.E+6

lat, s

size, B

sRPC dTCP dUCX

* See the backup slide #1 for the details about point-to-point benchmark

Revamp of PMIx plugin collectives

PMIx plugin collectives infrastructure was also redesigned to leverage direct-connect feature.

The results of a collective micro-benchmark (see backup slide #2) for 32-node cluster (one stepd per node)

are provided below:

1.E+0 1.E+1 1.E+2 1.E+3 1.E+4 1.E+5

latency, s

size, B

sRPC dTCP dUCX

* See the backup slide #2 for the details about collectives benchmark

Early wireup feature

Implementation of the direct-connect assumes that Slurm RPC is still used for the address exchange.

This address exchange is initiated at the first communication.

This is an issue for PMIx full modex mode, because the first communication is usually the heaviest

(Allgatherv).

To deal with that an early-wireup feature was introduced.

• The main idea is that step daemons start wiring up right after they were launched without waiting for

the first communication.

• Open MPI as an example usually does some local initialization that provides a reasonable room to

perform the wireup in the background.

SLURM_PMIX_DIRECT_CONN_EARLY = {true | false}

Performance results for Open MPI modex

At the small scale the latency of PMIx_Fence() is affected by the processes imbalance.

To get the clear numbers we modified Open MPI ompi_mpi_init function by adding 2 additional

PMIx_Barrier()s as shown on the diagram below:

PMIx_Fence(collect=1)

ompi_mpi_init() [orig]

Various initializations

Open fabric and submit the keys

add_proc

other stuff

2xPMIx_Fence(collect=0)

ompi_mpi_init() [eval]

Imbalance:

150us – 190ms

Imbalance:

200us – 4ms

Performance results for Open MPI modex (2)

Below is the dependency of an average of a maximum time spent in PMIx_Fence(collect=1) relative to the

number of nodes is presented:

Measurements configuration:

• 32 nodes, 20 cores per node

• PPN = 20

• PMIx v1.2

• OMPI v2.x

• Slurm 17.11 (pre-release)

0 5 10 15 20 25 30 35

lat, s

dTCP dUCX sRPC

Future work

Need wider testing of new features

• Let us know if you have any issues: artemp [at] mellanox.com

Scaling tests and performance analysis

• Need to evaluate efficiency of early wireup feature

Analyze possible impacts on other jobstart stages:

• Propagation of the Slurm launch message (deviation ~2ms).

• Initialization of the PMIx and UCX libraries (local overhead)

• Impact of UCX used for resource management on application processes

• Impact of local PMIx overhead

Use this feature as an intermediate stage for instant-on

• Pre-calculate job’s stepd endpoint information and use UCX to exchange endpoint info for

application processes.

Thank You

Integrated point-to-point micro-benchmark (backup#1)

To estimate the point-to-point latency of available transports the point-to-point micro-benchmark was

introduced in Slurm PMIx plugin.

To activate it, Slurm must be configured with “--enable-debug” option.

SLURM_PMIX_WANT_PP=1

Turn point-to-point benchmark on

SLURM_PMIX_PP_LOW_PWR2=0

SLURM_PMIX_PP_UP_PWR2=22

Message size range (powers of 2)

from 1 to 4194304 in this example

SLURM_PMIX_PP_ITER_SMALL=100

Number of iterations for small messages

SLURM_PMIX_PP_ITER_LARGE=20

Number of iterations for large messages

SLURM_PMIX_PP_LARGE_PWR2=10

Switch to the large message starting

from 2^val

1.E+0 1.E+2 1.E+4 1.E+6

lat, s

size, B

sRPC dTCP dUCX

Collective micro-benchmark (backup #2)

PMIx plugin collectives infrastructure was also redesigned to leverage direct-connect feature.

The results of a collective micro-benchmark for 32-node cluster (one stepd per node) are provided below:

1.E+0 1.E+1 1.E+2 1.E+3 1.E+4 1.E+5

latency, s

size, B

sRPC dTCP dUCX

SLURM_PMIX_WANT_COLL_PERF=1

Turn collective benchmark on

SLURM_PMIX_COLL_PERF_LOW_PWR2=0

SLURM_PMIX_COLL_PERF_UP_PWR2=22

Message size range (powers of 2)

from 1 to 65536 in this example

SLURM_PMIX_COLL_PERF_ITER_SMALL=100

Number of iterations for small messages

SLURM_PMIX_COLL_PERF_ITER_LARGE=20

Number of iterations for large messages

SLURM_PMIX_COLL_PERF_LARGE_PWR2=10

Switch to the large message starting

from 2^val

Towards Exascale: Leveraging InfiniBand to …...The performance of the OpenSHMEM jobstart was...

Documents

Transcript of Towards Exascale: Leveraging InfiniBand to …...The performance of the OpenSHMEM jobstart was...

OpenSHMEM NonBlocking Data Movement Opera8ons with …mvapich.cse.ohio-state.edu/static/media/talks/slide/PAW... · 2017. 7. 18. · OpenSHMEM NonBlocking Data Movement Opera8ons

Working with Disadvantaged Youth - doleta.gov · Working with Disadvantaged Youth ... pact findings based on a random assignment research design and a survey ... Subsamples from JOBSTART

Oak Ridge OpenSHMEM Benchmark Suite · NPB •NPB: NAS Parallel Benchmarks –Application kernels for common scientific algorithms –OpenSHMEM versions adapted from MPI variants

SHMemCache: Enabling Memcached on the OpenSHMEM Global …ww2.cs.fsu.edu/~fu/files/shmemcache-openshmem16-paper.pdf · Memcached is a popular open-source, distributed caching system.

INTRODUCING SANDIA* OPENSHMEM WITH MULTI … · INTRODUCING SANDIA* OPENSHMEM WITH MULTI-FABRIC SUPPORT USING LIBFABRIC: Kayla Seager, Software Engineer, Intel Ryan Grant, ... NodeA

JobStart Part 5: Writing the Cover Letter · FCS5216 JobStart Part 5: Writing the Cover Letter1 Elizabeth B. Bolton, Jeannette K. Remington, and George O. Hack2 1. This document is

Department of Labor and Employmentro7.dole.gov.ph/fndr/mis/files/2017-program-services.pdf · 2018. 3. 6. · JobStart Philippines Program K-12 DOLE Adjustment Measures Program. T

JobStart Philippines · 2019. 11. 19. · DOLE-BLE JobStart Project Team Life Skills Training Providers Technical Training Providers Life Skills Training Technical Training OJT/Work

The OpenSHMEM Analyzer Version 1 · OpenSHMEMAnalyzerGuide Preface This User’s Manual describes the OpenSHMEM Analyzer [2], a tool that provides source code analysis and correctness

Don’t just find a jobStart a Career!partner.ed2go.com/wp-content/uploads/2017/10/0817... · Microsoft Office Specialist 2013 455 HRS The Administrative Professional with Microsoft

Many Thanks to Our Funders, Partners, Employers and ...jobstart-cawl.org/portals/0/files/annualreports/JS_Annual_YR0809.pdf · Balu Mistry Boris Wells Francis Solari Marina Taverner

JOBSTART title page - mdrc.org · Ford Foundation Charles Stewart Mott Foundation The William and Flora Hewlett Foundation National Commission for Employment Policy AT&T Foundation

JobStart Philippines: Making Every Young Juan and Maria Job Ready by Ruth R. Rodriguez

1st JobStart Philippines Program graduation: Milestones include ...

Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World ...Bringing NVIDIA GPUs to the PGAS/OpenSHMEM World: Challenges and Solutions ... GPU Technology Conference (GTC ... MTU, UCB, U. Florida,

JobStart Philippines Program (Republic Act 19689) 10869.pdf · Philippines in Congress assembled: ... The training plan will describe the training curriculum or. modules to be used,

A Multithreaded Communication Substrate for OpenSHMEM€¦ · A Multithreaded Communication Substrate for OpenSHMEM OpenSHEM Users’ Group Workshop Oct. 7, 2014, Eugene, OR Aurélien

High-Performance and Scalable Designs of Programming ......High Performance Parallel Programming Models Message Passing Interface (MPI) PGAS (UPC, OpenSHMEM, CAF, UPC++) Hybrid ---

Designing OpenSHMEM and Hybrid …...Network Based Computing Laboratory OpenSHMEM Workshop (August ‘16) 8 Overview of the MVAPICH2 Project • High Performance open-source MPI Library

OpenSHMEM 2015 - Oak Ridge National Laboratory...OpenSHMEM 2015 August 4-6, 2015 Annapolis, Maryland Workshop Program & Abstracts The Future of OpenSHMEM and Related Technologies.