K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to...

37
K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel Cray Inc. Apr. 2015 Copyright 2015 Cray Inc

Transcript of K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to...

Page 1: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and

M. Pagel

Cray Inc.

Apr. 2015

Copyright 2015 Cray Inc

Page 2: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual

property rights is granted by this document.

Cray Inc. may make changes to specifications and product descriptions at any time, without notice.

All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.

Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate

from published specifications. Current characterized errata are available on request.

Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers

and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray

Inc. internal codenames is at the sole risk of the user.

Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of

Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect

actual performance.

The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design,

SONEXION, URIKA and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER

CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks,

and trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT and XT. The registered trademark LINUX is used pursuant to a sublicense

from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.

Other names and brands may be claimed as the property of others. Other product and service names mentioned herein are the

trademarks of their respective owners..

CUG 2015 Cray Inc. Proprietary © 2015 2

Page 3: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda

● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features

● Summary and Future work

● Discussion

3 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 4: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Introduction

4

● Multi-/Many-core architectures and accelerators are driving the HPC evolution

● Modern interconnects offer low latency, high bandwidth communication

● Cray designs some of the fastest supercomputers

● Intel MIC and NVIDIA GPUs pose new challenges and opportunities

● Critical to design MPI and SHMEM software stacks in a very efficient manner

Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 5: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Designing High Performance MPI and SHMEM Software on Cray Systems: Challenges and Objectives

5

● Minimize communication latency, maximize communication bandwidth

● Improve support for asynchronous communication (communication/computation overlap) ● Architecture-specific solutions to optimize communication

performance

● New tools and features to help users understand application performance bottlenecks

● Fault resilient communication

Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 6: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Cray User Group, Apr. 2015 6

MPI Pt2pt and Collectives

MPI-3 RMA

MPI IO

Multi-threaded MPI Comm.

Cray SHMEM

Features and optimizations

Fault Tolerant MPI

Improved Designs in

Software

(New Algorithms, etc.)

Leveraging Aries Hardware

Tools to understand

application performance

Designing High Performance MPI and SHMEM Software on Cray Systems: Challenges and Objectives

Research & Development Focus Areas Design Approach &

Methodology

Copyright 2015 Cray Inc

Page 7: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- MPI User Level Fault Mitigation

● Summary and Future work

● Discussion

7 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 8: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Architecture Optimized memcpy

Copyright 2015 Cray Inc 8

05000

1000015000200002500030000

Ban

dw

idth

(M

B/s

)

Message Length (Bytes)

CrayMPICH SNB memcpy CrayMPICH HSW memcpy

OSU BW Benchmark OSU MBW-MR Benchmark

0

5000

10000

15000

20000

25000

Ban

dw

idth

(M

B/s

)

Message Length (Bytes) 2 MPI Processes: 1 HSW (32-core) node

MPICH_OPTIMIZED_MEMCPY = 2 (Default: 1; Values: 0, 1, 2)

If MPICH_USE_SYSTEM_MEMCPY is set, Optimized memcpy is disabled

29% 17%

Cray User Group, Apr. 2015

Page 9: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- MPI User Level Fault Mitigation

● Summary and Future work

● Discussion

9 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 10: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Allreduce (Small Messages)

0102030405060

4 8 16 32 64 128

Avg

. L

ate

ncy (

usec)

Message Length (Bytes)

7.0.0 (Default) 7.0.0-DMAPP

7.2.0-Shared-Mem 7.2.0-DMAPP-Shared-Mem

119,648 MPI Processes: 32 ppn, 32-core HSW nodes.

7.2.0-DMAPP-SHARED-MEM improves performance by about 23% when compared to 7.0.0-

DMAPP

MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0. Soon to be enabled by default)

MPICH_USE_DMAPP_COLL = 1 (Default: 0) (-Wl,--whole-archive,-ldmapp,--no-whole-archive)

10 Copyright 2015 Cray Inc

8%

23%

Cray User Group, Apr. 2015

Page 11: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Allreduce (Small Messages, on KNC)

0

50

100

150

200

4 8 16 32 64 128 256

Avg

. L

ate

ncy (

usec)

Message Length (Bytes)

7.0.0 7.2.0-Shared-Mem 7.2.0-DMAPP-Shared-Mem

19%

75%

256 MPI Processes: 16 ppn, 16 KNC nodes

7.2.0 with Shared-memory collective optimizations performs about 19% better

7.2.0-DMAPP + Shared-memory optimization improves performance by about 75%.

MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0)

MPICH_USE_DMAPP_COLL = 1 (Default: 0) -Wl,--whole-archive,-ldmapp,--no-whole-archive

11 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 12: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Allreduce (Large Messages)

020406080

100120140160

128K 256K 512K 1M 2M 4M 8M 16M 32MAvg

. L

ate

ncy (

msec)

Message Length (Bytes)

7.0.0 7.2.0

57%

8,192 MPI Processes: 32 ppn, 32-core HSW nodes.

Cray MPI 7.2.0 performs about 57% better than Cray MPI 7.0.0

for large message Allreduce operations. (Enabled by default)

12 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 13: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Application Performance: POP

1,024 MPI Processes, 32 core Haswell nodes, 32 ppn.

nx_global = 2048 ny_global = 768; 40 vertical levels, 2 tracers, 1500 timesteps

MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0)

POP Total runtime improved by about 6%

13

Copyright 2015 Cray Inc

0102030405060708090

100

1024

To

tal R

un

tim

e (

s)

# MPI Processes

7.2.0 7.2.0-Shared-Memory-Opt

6%

Cray User Group, Apr. 2015

Page 14: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Bcast (Small Messages)

02468

101214

4 8 16 32

Avg

. L

ate

ncy

(usec)

Message Length (Bytes)

7.0.0-Default 7.2.1-DMAPP 7.2.1-DMAPP-SHARED-MEM

50%

67%

8,192 MPI Processes: 32 ppn, 32-core HSW nodes.

7.2.1-DMAPP performs about 50% better

7.2.1-DMAPP with Shared-Memory optimization performs about 67% better

MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0. Soon to be enabled by default)

MPICH_USE_DMAPP_COLL = 1 (Default: 0) -Wl,--whole-archive,-ldmapp,--no-whole-archive

14 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 15: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Alltoall and MPI_Alltoallv Optimizations

Copyright 2015 Cray Inc 15

0

1000

2000

3000

4000

5000

6000

Ba

nd

wid

th M

B/s

ec/n

od

e

Message Size (bytes)

5X

4,096 MPI Processes. 24 ranks per node, 64M Hugepages.

Enabled by default. Optimized Alltoall/Alltoallv solutions perform up to 5X faster

(Set MPICH_GNI_COLL_OPT_OFF to disable)

Cray User Group, Apr. 2015

7.2.0-A2A-Default 7.0.0-A2A-Default 7.2.0-A2Av-Default 7.0.0-A2AvDefault

Page 16: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- User Level Fault Mitigation

● Summary and Future work

● Discussion

16 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 17: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Ialltoall (Large Messages)

8,192 MPI Processes: 24-Core Ivy-Bridge nodes

New prototype delivers up to 70% comm./comp. overlap

17

0

1000

2000

3000

4000

5000

8K 16K 32K 64K

Avg

. L

ate

ncy

(msec)

Message Length (Bytes) Comm. Latency

7.0.0 7.2.0

0

20

40

60

80

8K 16K 32K 64K

Overl

ap

(%

)

Message Length (Bytes) Comm./Comp. Overlap

Cray User Group, Apr. 2015

70%

MPICH_NEMESIS_ASYNC_PROGRESS=1;

MPICH_MAX_THREAD_SAFETY=multiple; (CLE 5.2 UP02 or newer)

Copyright 2015 Cray Inc

Page 18: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Iallreduce (Small Messages)

8,192 MPI Processes: 24-Core Ivy-Bridge nodes

New Iallreduce optimization delivers up to 80% comm./comp. overlap export MPICH_NEMESIS_ASYNC_PROGRESS=1; export MPICH_SHARED_MEM_COLL_OPT=1

export MPICH_MAX_THREAD_SAFETY=multiple; export MPICH_USE_DMAPP_COLL=1

(CLE 5.2 UP02 or newer)

18

0

10

20

30

40

50

4 8 16

Avg

. L

ate

ncy

(usec)

Message Length (Bytes) Comm. Latency

7.0.0 7.2.0

-10

10

30

50

70

90

4 8 16

Overl

ap

(%

)

Message Length (Bytes) Comm./Comp. Overlap

Cray User Group, Apr. 2015

80%

Copyright 2015 Cray Inc

Page 19: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- User Level Fault Mitigation

● Summary and Future work

● Discussion

19 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 20: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI-3 RMA: 8-Byte MPI_Put Latency Network Atomic Memory Operations

0

0.5

1

1.5

2

2.5

3

8

Late

ncy (

usec)

Message Length (Bytes)

7.0.0 - Default 7.2.0 (DMAPP -- Network AMO's)

63%

7.2.0-DMAPP improves MPI-3 RMA small message latency by about 63%

Design uses Atomic Memory Operations offered by the Aries network.

MPICH_RMA_OVER_DMAPP = 1 (Default: 0)

MPICH_RMA_USE_NETWORK_AMO=1 (Default: 0)

-Wl,--whole-archive,-ldmapp,--no-whole-archive

20 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 21: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI-3 RMA GPU-to-GPU

01000200030004000500060007000

Ban

dw

idth

(M

B/s

)

Message Length (Bytes)

One-sided g2g Manual copy-through-host

50%

Optimized GPU-to-GPU communication for point-to-point and collectives

New prototype designs improve GPU-to-GPU RMA communication bandwidth

by about 50%. (Internal prototype available. Will be released in June 2015)

21 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 22: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI One-sided: On-node latency (Post Start Wait Complete)

0123456

0 1 2 4 8

16

32

64

12

8

25

6

51

2

1K

2K

4K

8K

16KL

ate

ncy (

usec)

Message Length (Bytes)

7.2.0 7.1.3

2 MPI Processes. 1 Ivy-Bridge compute node (24-core)

On-Node MPI-3 one-sided operations perform about 87% better with Cray

MPICH 7.2.0 (Enabled by default)

Cray User Group, Apr. 2015

87%

Copyright 2015 Cray Inc 22

Page 23: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- User Level Fault Mitigation

● Summary and Future work

● Discussion

23 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 24: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

020406080

100120140160

0 2 8

32

12

8

512

2K

8K

32K

128K

Late

ncy (

usec)

Message Length (Bytes)

7.0.0 7.2.0 (FG-POBJ)

52%

Osu_latency_mt.c benchmark. 1 ppn, 2 nodes, 31 threads per process

MPICH_MAX_THREAD_SAFETY=multiple

cc –o osu_latency_mt.x –craympich-mt osu_latency_mt.c

aprun –n2 –N1 –d32 ./osu_latency_mt.x

01000200030004000500060007000

0 2 8

32

128

512

2K

8K

32K

12

8KL

ate

ncy (

usec)

Message Length (Bytes)

352us 489us

Fine-Grained Multi-threading for MPI

32-core Haswell KNC

24 Cray User Group, Apr. 2015

>10X

Copyright 2015 Cray Inc

Page 25: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- User Level Fault Mitigation

● Summary and Future work

● Discussion

25 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 27: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- User Level Fault Mitigation

● Summary and Future work

● Discussion

27 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 28: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Cray SHMEM: Features and Optimizations

05000

100001500020000250003000035000

Late

ncy (

usec)

Message Length (KBytes)

Cray SHMEM-7.2.0 Cray SHMEM-7.0.0

0

10

20

30

40

508

16

32

64

128

256

512

1K

2KL

ate

ncy (

usec)

Message Length (Bytes)

• Shmem_broadcast (about 60% improvement in comm. latency)

1,024 MPI Processes: 24-Core Ivy-Bridge nodes (Enabled by default)

• Cray SHMEM supports shmem_global_exit()

(Not enabled by default: Set SHMEM_GLOBAL_EXIT=1)

Cray User Group, Apr. 2015

64% 60%

28 Copyright 2015 Cray Inc

Page 29: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized memcpy

- Collective Optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and Features

- MPI User Level Fault Mitigation

● Summary and Future work

● Discussion

29 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 30: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI User Level Fault Mitigation (ULFM)

● Critical to design MPI libraries in a fault resilient manner ● MPI Fault Tolerance Working Group is working towards

standardizing the ULFM interface ● A prototype implementation of ULFM in Cray MPICH for XC

systems is under development ● Status of the current prototype: - supports the new MPI_ERR error classes - automatically detects node failures - supports a subset of ULFM functions to re-build MPI objects (revoke, shrink, agree..)

● Future development plans depend on MPI Forum directions

Cray User Group, Apr. 2015 30

Copyright 2015 Cray Inc

Page 31: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Agenda ● Introduction

● Recent Cray MPI and Cray SHMEM optimizations and features:

- Architecture Optimized Memcpy

- Collective optimizations

Blocking ( Communication Latency)

Non-Blocking (Comm./Comp. Overlap)

- MPI-3 RMA and GPU Optimizations

- Fine-grained Multi-threading for MPI

- MPI-IO

- Cray SHMEM Optimizations and features

- MPI User Level Fault Mitigation

● Summary and Future work

● Discussion

31 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 32: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Summary and Future Work

● Conclusion: ● Improved MPI Point-to-Point, Collective and RMA performance

● New features and optimizations in Cray SHMEM

● Exploring architecture-specific optimizations for Intel Xeon and Intel Xeon Phi

● Prototyping ULFM support in Cray MPICH

● Future Work: ● Improving Cray MPICH and Cray SHMEM performance on future

Intel Xeon and Intel Xeon Phi processors

● Reduced memory footprint for Cray MPICH

● Improved support for multi-threaded MPI applications

Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

32

Page 33: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

33 Copyright 2015 Cray Inc Cray User Group, Apr. 2015

Page 34: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

34 Copyright 2015 Cray Inc Cray User Group, Apr. 2015

Page 35: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Ialltoallv (Large Messages)

4,096 MPI Processes: 24-Core Ivy-Bridge nodes

Improved comm./comp. overlap (by up to 80%)

35

0

1000

2000

3000

4000

5000

8K 16K 32K 64K 512K

Avg

. L

ate

ncy

(msec)

Message Length (Bytes) Comm. Latency

7.0.0-Async 7.2.0-Async

0

20

40

60

80

100

8K 16K 32K 64K 128K

Overl

ap

(%

)

Message Length (Bytes) Comm./Comp. Overlap

Cray User Group, Apr. 2015

80%

MPICH_NEMESIS_ASYNC_PROGRESS=1;

MPICH_MAX_THREAD_SAFETY=multiple; (CLE 5.2 UP02 or newer)

Copyright 2015 Cray Inc

Page 36: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

MPI_Allgather and MPI_Allgatherv

0200400600800

10001200

512K 1M

Avg

. L

ate

ncy (

msec)

Message Length (Bytes)

7.0.0 7.0.3

1024 MPI Processes:

MPT 7.0.3 performs about 35-45% better.

25.9%

35%

Enabled by Default (Aries/Gemini)

36

0100200300400500600700800

256K 512K

Avg

. L

ate

ncy (

msec)

Message Length (Bytes)

35%

512 MPI Processes:

MPT 7.0.3 performs about 25-35% better.

45%

Allgatherv Latency Comparison

(Gemini)

Allgather Latency Comparison

(Aries)

Cray User Group, Apr. 2015 Copyright 2015 Cray Inc

Page 37: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates

Global lock vs Per-Object locks

Pthread_lock

(Global_mutex)

MPI_Send

T1 T2 T3 T4

Pthread_unlock

(Global_mutex)

MPID_Send()

Progress_engine()

GNI netmod

Pthread_lock

(msgq_mutex)

Pthread_lock

(ch3comm_mutex)

Pthread_lock

(global_mutex)

T1 T2 T3 T4

Progress_engine()

GNI netmod

Default:

- MPI library uses a single global_mutex

MPI_Send

pthread_mutex_lock(global_mutex)

MPID_Send()

MPID_Progress_wait()

pthread_mutex_unlock(global_mutex)

Per-Obj (Fine-Grained):

- MPI library still relies on a global_mutex.

Along with several smaller locks:

msgq_mutex, handle_mutex, comp_mutex

- Global critical sections are now smaller

Fine-Grained Multi-threading for MPI

37 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc