K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to...

K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and

M. Pagel

Cray Inc.

Apr. 2015

Copyright 2015 Cray Inc

Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual

property rights is granted by this document.

Cray Inc. may make changes to specifications and product descriptions at any time, without notice.

All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate

from published specifications. Current characterized errata are available on request.

Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers

and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray

Inc. internal codenames is at the sole risk of the user.

Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of

Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect

actual performance.

The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design,

SONEXION, URIKA and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER

CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks,

and trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT and XT. The registered trademark LINUX is used pursuant to a sublicense

from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.

Other names and brands may be claimed as the property of others. Other product and service names mentioned herein are the

trademarks of their respective owners..

CUG 2015 Cray Inc. Proprietary © 2015 2

Agenda

● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features

● Summary and Future work

● Discussion

3 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Introduction

4

● Multi-/Many-core architectures and accelerators are driving the HPC evolution

● Modern interconnects offer low latency, high bandwidth communication

● Cray designs some of the fastest supercomputers

● Intel MIC and NVIDIA GPUs pose new challenges and opportunities

● Critical to design MPI and SHMEM software stacks in a very efficient manner

Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Designing High Performance MPI and SHMEM Software on Cray Systems: Challenges and Objectives

5

● Minimize communication latency, maximize communication bandwidth

● Improve support for asynchronous communication (communication/computation overlap) ● Architecture-specific solutions to optimize communication

performance

● New tools and features to help users understand application performance bottlenecks

● Fault resilient communication


Cray User Group, Apr. 2015 6

MPI Pt2pt and Collectives

MPI-3 RMA

MPI IO

Multi-threaded MPI Comm.

Cray SHMEM

Features and optimizations

Fault Tolerant MPI

Improved Designs in

Software

(New Algorithms, etc.)

Leveraging Aries Hardware

Tools to understand

application performance

Designing High Performance MPI and SHMEM Software on Cray Systems: Challenges and Objectives

Research & Development Focus Areas Design Approach &

Methodology


Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- MPI User Level Fault Mitigation


● Discussion


Architecture Optimized memcpy

Copyright 2015 Cray Inc 8

05000

1000015000200002500030000

Ban

dw

idth

(M

B/s

)

Message Length (Bytes)

CrayMPICH SNB memcpy CrayMPICH HSW memcpy

OSU BW Benchmark OSU MBW-MR Benchmark

0

5000

10000

15000

20000

25000

Ban

dw

idth

(M

B/s

)

Message Length (Bytes) 2 MPI Processes: 1 HSW (32-core) node

MPICH_OPTIMIZED_MEMCPY = 2 (Default: 1; Values: 0, 1, 2)

If MPICH_USE_SYSTEM_MEMCPY is set, Optimized memcpy is disabled

29% 17%

Cray User Group, Apr. 2015









- MPI-IO




● Discussion


MPI_Allreduce (Small Messages)

0102030405060

4 8 16 32 64 128

Avg

. L

ate

ncy (

usec)


7.0.0 (Default) 7.0.0-DMAPP

7.2.0-Shared-Mem 7.2.0-DMAPP-Shared-Mem

119,648 MPI Processes: 32 ppn, 32-core HSW nodes.

7.2.0-DMAPP-SHARED-MEM improves performance by about 23% when compared to 7.0.0-

DMAPP

MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0. Soon to be enabled by default)

MPICH_USE_DMAPP_COLL = 1 (Default: 0) (-Wl,--whole-archive,-ldmapp,--no-whole-archive)

10 Copyright 2015 Cray Inc

8%

23%


MPI_Allreduce (Small Messages, on KNC)

0

50

100

150

200

4 8 16 32 64 128 256

Avg

. L

ate

ncy (

usec)


7.0.0 7.2.0-Shared-Mem 7.2.0-DMAPP-Shared-Mem

19%

75%

256 MPI Processes: 16 ppn, 16 KNC nodes

7.2.0 with Shared-memory collective optimizations performs about 19% better

7.2.0-DMAPP + Shared-memory optimization improves performance by about 75%.

MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0)

MPICH_USE_DMAPP_COLL = 1 (Default: 0) -Wl,--whole-archive,-ldmapp,--no-whole-archive


MPI_Allreduce (Large Messages)

020406080

100120140160

128K 256K 512K 1M 2M 4M 8M 16M 32MAvg

. L

ate

ncy (

msec)


7.0.0 7.2.0

57%


Cray MPI 7.2.0 performs about 57% better than Cray MPI 7.0.0

for large message Allreduce operations. (Enabled by default)


Application Performance: POP

1,024 MPI Processes, 32 core Haswell nodes, 32 ppn.

nx_global = 2048 ny_global = 768; 40 vertical levels, 2 tracers, 1500 timesteps

MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0)

POP Total runtime improved by about 6%

13


0102030405060708090

100

1024

To

tal R

un

tim

e (

s)

# MPI Processes

7.2.0 7.2.0-Shared-Memory-Opt

6%


MPI_Bcast (Small Messages)

02468

101214

4 8 16 32

Avg

. L

ate

ncy

(usec)


7.0.0-Default 7.2.1-DMAPP 7.2.1-DMAPP-SHARED-MEM

50%

67%


7.2.1-DMAPP performs about 50% better

7.2.1-DMAPP with Shared-Memory optimization performs about 67% better

MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0. Soon to be enabled by default)

MPICH_USE_DMAPP_COLL = 1 (Default: 0) -Wl,--whole-archive,-ldmapp,--no-whole-archive


MPI_Alltoall and MPI_Alltoallv Optimizations


0

1000

2000

3000

4000

5000

6000

Ba

nd

wid

th M

B/s

ec/n

od

e

Message Size (bytes)

5X

4,096 MPI Processes. 24 ranks per node, 64M Hugepages.

Enabled by default. Optimized Alltoall/Alltoallv solutions perform up to 5X faster

(Set MPICH_GNI_COLL_OPT_OFF to disable)


7.2.0-A2A-Default 7.0.0-A2A-Default 7.2.0-A2Av-Default 7.0.0-A2AvDefault









- MPI-IO


- User Level Fault Mitigation


● Discussion


MPI_Ialltoall (Large Messages)

8,192 MPI Processes: 24-Core Ivy-Bridge nodes

New prototype delivers up to 70% comm./comp. overlap

17

0

1000

2000

3000

4000

5000

8K 16K 32K 64K

Avg

. L

ate

ncy

(msec)

Message Length (Bytes) Comm. Latency

7.0.0 7.2.0

0

20

40

60

80

8K 16K 32K 64K

Overl

ap

(%

)

Message Length (Bytes) Comm./Comp. Overlap


70%

MPICH_NEMESIS_ASYNC_PROGRESS=1;

MPICH_MAX_THREAD_SAFETY=multiple; (CLE 5.2 UP02 or newer)


MPI_Iallreduce (Small Messages)


New Iallreduce optimization delivers up to 80% comm./comp. overlap export MPICH_NEMESIS_ASYNC_PROGRESS=1; export MPICH_SHARED_MEM_COLL_OPT=1

export MPICH_MAX_THREAD_SAFETY=multiple; export MPICH_USE_DMAPP_COLL=1

(CLE 5.2 UP02 or newer)

18

0

10

20

30

40

50

4 8 16

Avg

. L

ate

ncy

(usec)


7.0.0 7.2.0

-10

10

30

50

70

90

4 8 16

Overl

ap

(%

)



80%










- MPI-IO




● Discussion


MPI-3 RMA: 8-Byte MPI_Put Latency Network Atomic Memory Operations

0

0.5

1

1.5

2

2.5

3

8

Late

ncy (

usec)


7.0.0 - Default 7.2.0 (DMAPP -- Network AMO's)

63%

7.2.0-DMAPP improves MPI-3 RMA small message latency by about 63%

Design uses Atomic Memory Operations offered by the Aries network.

MPICH_RMA_OVER_DMAPP = 1 (Default: 0)

MPICH_RMA_USE_NETWORK_AMO=1 (Default: 0)

-Wl,--whole-archive,-ldmapp,--no-whole-archive


MPI-3 RMA GPU-to-GPU

01000200030004000500060007000

Ban

dw

idth

(M

B/s

)


One-sided g2g Manual copy-through-host

50%

Optimized GPU-to-GPU communication for point-to-point and collectives

New prototype designs improve GPU-to-GPU RMA communication bandwidth

by about 50%. (Internal prototype available. Will be released in June 2015)


MPI One-sided: On-node latency (Post Start Wait Complete)

0123456

0 1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16KL

ate

ncy (

usec)


7.2.0 7.1.3

2 MPI Processes. 1 Ivy-Bridge compute node (24-core)

On-Node MPI-3 one-sided operations perform about 87% better with Cray

MPICH 7.2.0 (Enabled by default)


87%










- MPI-IO




● Discussion


020406080

100120140160

0 2 8

32

12

8

512

2K

8K

32K

128K

Late

ncy (

usec)


7.0.0 7.2.0 (FG-POBJ)

52%

Osu_latency_mt.c benchmark. 1 ppn, 2 nodes, 31 threads per process

MPICH_MAX_THREAD_SAFETY=multiple

cc –o osu_latency_mt.x –craympich-mt osu_latency_mt.c

aprun –n2 –N1 –d32 ./osu_latency_mt.x

01000200030004000500060007000

0 2 8

32

128

512

2K

8K

32K

12

8KL

ate

ncy (

usec)


352us 489us

Fine-Grained Multi-threading for MPI

32-core Haswell KNC

24 Cray User Group, Apr. 2015

>10X










- MPI-IO




● Discussion


MPI-I/O STATS

26

Timeline of MPI-I/O statistics. Many different variables tracked export MPICH_MPIIO_STATS=2

For more information contact: David Knaak, Bob Cernohous

(http://mptwiki.us.cray.com/pmwiki/pmwiki.php?n=MPI-IO.S-0013-20)


http://mptwiki.us.cray.com/pmwiki/pmwiki.php?n=MPI-IO.S-0013-20

















- MPI-IO




● Discussion


Cray SHMEM: Features and Optimizations

05000

100001500020000250003000035000

Late

ncy (

usec)

Message Length (KBytes)

Cray SHMEM-7.2.0 Cray SHMEM-7.0.0

0

10

20

30

40

508

16

32

64

128

256

512

1K

2KL

ate

ncy (

usec)


• Shmem_broadcast (about 60% improvement in comm. latency)

1,024 MPI Processes: 24-Core Ivy-Bridge nodes (Enabled by default)

• Cray SHMEM supports shmem_global_exit()

(Not enabled by default: Set SHMEM_GLOBAL_EXIT=1)


64% 60%

28 Copyright 2015 Cray Inc









- MPI-IO




● Discussion


MPI User Level Fault Mitigation (ULFM)

● Critical to design MPI libraries in a fault resilient manner ● MPI Fault Tolerance Working Group is working towards

standardizing the ULFM interface ● A prototype implementation of ULFM in Cray MPICH for XC

systems is under development ● Status of the current prototype: - supports the new MPI_ERR error classes - automatically detects node failures - supports a subset of ULFM functions to re-build MPI objects (revoke, shrink, agree..)

● Future development plans depend on MPI Forum directions

Cray User Group, Apr. 2015 30




- Architecture Optimized Memcpy

- Collective optimizations





- MPI-IO

- Cray SHMEM Optimizations and features



● Discussion


Summary and Future Work

● Conclusion: ● Improved MPI Point-to-Point, Collective and RMA performance

● New features and optimizations in Cray SHMEM

● Exploring architecture-specific optimizations for Intel Xeon and Intel Xeon Phi

● Prototyping ULFM support in Cray MPICH

● Future Work: ● Improving Cray MPICH and Cray SHMEM performance on future

Intel Xeon and Intel Xeon Phi processors

● Reduced memory footprint for Cray MPICH

● Improved support for multi-threaded MPI applications


32

MPI_Ialltoallv (Large Messages)


Improved comm./comp. overlap (by up to 80%)

35

0

1000

2000

3000

4000

5000

8K 16K 32K 64K 512K

Avg

. L

ate

ncy

(msec)


7.0.0-Async 7.2.0-Async

0

20

40

60

80

100

8K 16K 32K 64K 128K

Overl

ap

(%

)



80%

MPICH_NEMESIS_ASYNC_PROGRESS=1;

MPICH_MAX_THREAD_SAFETY=multiple; (CLE 5.2 UP02 or newer)


MPI_Allgather and MPI_Allgatherv

0200400600800

10001200

512K 1M

Avg

. L

ate

ncy (

msec)


7.0.0 7.0.3

1024 MPI Processes:

MPT 7.0.3 performs about 35-45% better.

25.9%

35%

Enabled by Default (Aries/Gemini)

36

0100200300400500600700800

256K 512K

Avg

. L

ate

ncy (

msec)


35%

512 MPI Processes:

MPT 7.0.3 performs about 25-35% better.

45%

Allgatherv Latency Comparison

(Gemini)

Allgather Latency Comparison

(Aries)


Global lock vs Per-Object locks

Pthread_lock

(Global_mutex)

MPI_Send

T1 T2 T3 T4

Pthread_unlock

(Global_mutex)

MPID_Send()

Progress_engine()

GNI netmod

Pthread_lock

(msgq_mutex)

Pthread_lock

(ch3comm_mutex)

Pthread_lock

(global_mutex)

T1 T2 T3 T4

Progress_engine()

GNI netmod

Default:

- MPI library uses a single global_mutex

MPI_Send

pthread_mutex_lock(global_mutex)

MPID_Send()

MPID_Progress_wait()

pthread_mutex_unlock(global_mutex)

Per-Obj (Fine-Grained):

- MPI library still relies on a global_mutex.

Along with several smaller locks:

msgq_mutex, handle_mutex, comp_mutex

- Global critical sections are now smaller

Fine-Grained Multi-threading for MPI


K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to...

Documents

Transcript of K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to...