K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to...
Transcript of K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to...
![Page 1: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/1.jpg)
K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and
M. Pagel
Cray Inc.
Apr. 2015
Copyright 2015 Cray Inc
![Page 2: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/2.jpg)
Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual
property rights is granted by this document.
Cray Inc. may make changes to specifications and product descriptions at any time, without notice.
All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice.
Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate
from published specifications. Current characterized errata are available on request.
Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers
and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray
Inc. internal codenames is at the sole risk of the user.
Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of
Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect
actual performance.
The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design,
SONEXION, URIKA and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER
CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks,
and trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT and XT. The registered trademark LINUX is used pursuant to a sublicense
from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis.
Other names and brands may be claimed as the property of others. Other product and service names mentioned herein are the
trademarks of their respective owners..
CUG 2015 Cray Inc. Proprietary © 2015 2
![Page 3: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/3.jpg)
Agenda
● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features
● Summary and Future work
● Discussion
3 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 4: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/4.jpg)
Introduction
4
● Multi-/Many-core architectures and accelerators are driving the HPC evolution
● Modern interconnects offer low latency, high bandwidth communication
● Cray designs some of the fastest supercomputers
● Intel MIC and NVIDIA GPUs pose new challenges and opportunities
● Critical to design MPI and SHMEM software stacks in a very efficient manner
Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 5: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/5.jpg)
Designing High Performance MPI and SHMEM Software on Cray Systems: Challenges and Objectives
5
● Minimize communication latency, maximize communication bandwidth
● Improve support for asynchronous communication (communication/computation overlap) ● Architecture-specific solutions to optimize communication
performance
● New tools and features to help users understand application performance bottlenecks
● Fault resilient communication
Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 6: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/6.jpg)
Cray User Group, Apr. 2015 6
MPI Pt2pt and Collectives
MPI-3 RMA
MPI IO
Multi-threaded MPI Comm.
Cray SHMEM
Features and optimizations
Fault Tolerant MPI
Improved Designs in
Software
(New Algorithms, etc.)
Leveraging Aries Hardware
Tools to understand
application performance
Designing High Performance MPI and SHMEM Software on Cray Systems: Challenges and Objectives
Research & Development Focus Areas Design Approach &
Methodology
Copyright 2015 Cray Inc
![Page 7: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/7.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized memcpy
- Collective Optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and Features
- MPI User Level Fault Mitigation
● Summary and Future work
● Discussion
7 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 8: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/8.jpg)
Architecture Optimized memcpy
Copyright 2015 Cray Inc 8
05000
1000015000200002500030000
Ban
dw
idth
(M
B/s
)
Message Length (Bytes)
CrayMPICH SNB memcpy CrayMPICH HSW memcpy
OSU BW Benchmark OSU MBW-MR Benchmark
0
5000
10000
15000
20000
25000
Ban
dw
idth
(M
B/s
)
Message Length (Bytes) 2 MPI Processes: 1 HSW (32-core) node
MPICH_OPTIMIZED_MEMCPY = 2 (Default: 1; Values: 0, 1, 2)
If MPICH_USE_SYSTEM_MEMCPY is set, Optimized memcpy is disabled
29% 17%
Cray User Group, Apr. 2015
![Page 9: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/9.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized memcpy
- Collective Optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and Features
- MPI User Level Fault Mitigation
● Summary and Future work
● Discussion
9 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 10: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/10.jpg)
MPI_Allreduce (Small Messages)
0102030405060
4 8 16 32 64 128
Avg
. L
ate
ncy (
usec)
Message Length (Bytes)
7.0.0 (Default) 7.0.0-DMAPP
7.2.0-Shared-Mem 7.2.0-DMAPP-Shared-Mem
119,648 MPI Processes: 32 ppn, 32-core HSW nodes.
7.2.0-DMAPP-SHARED-MEM improves performance by about 23% when compared to 7.0.0-
DMAPP
MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0. Soon to be enabled by default)
MPICH_USE_DMAPP_COLL = 1 (Default: 0) (-Wl,--whole-archive,-ldmapp,--no-whole-archive)
10 Copyright 2015 Cray Inc
8%
23%
Cray User Group, Apr. 2015
![Page 11: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/11.jpg)
MPI_Allreduce (Small Messages, on KNC)
0
50
100
150
200
4 8 16 32 64 128 256
Avg
. L
ate
ncy (
usec)
Message Length (Bytes)
7.0.0 7.2.0-Shared-Mem 7.2.0-DMAPP-Shared-Mem
19%
75%
256 MPI Processes: 16 ppn, 16 KNC nodes
7.2.0 with Shared-memory collective optimizations performs about 19% better
7.2.0-DMAPP + Shared-memory optimization improves performance by about 75%.
MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0)
MPICH_USE_DMAPP_COLL = 1 (Default: 0) -Wl,--whole-archive,-ldmapp,--no-whole-archive
11 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 12: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/12.jpg)
MPI_Allreduce (Large Messages)
020406080
100120140160
128K 256K 512K 1M 2M 4M 8M 16M 32MAvg
. L
ate
ncy (
msec)
Message Length (Bytes)
7.0.0 7.2.0
57%
8,192 MPI Processes: 32 ppn, 32-core HSW nodes.
Cray MPI 7.2.0 performs about 57% better than Cray MPI 7.0.0
for large message Allreduce operations. (Enabled by default)
12 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 13: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/13.jpg)
Application Performance: POP
1,024 MPI Processes, 32 core Haswell nodes, 32 ppn.
nx_global = 2048 ny_global = 768; 40 vertical levels, 2 tracers, 1500 timesteps
MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0)
POP Total runtime improved by about 6%
13
Copyright 2015 Cray Inc
0102030405060708090
100
1024
To
tal R
un
tim
e (
s)
# MPI Processes
7.2.0 7.2.0-Shared-Memory-Opt
6%
Cray User Group, Apr. 2015
![Page 14: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/14.jpg)
MPI_Bcast (Small Messages)
02468
101214
4 8 16 32
Avg
. L
ate
ncy
(usec)
Message Length (Bytes)
7.0.0-Default 7.2.1-DMAPP 7.2.1-DMAPP-SHARED-MEM
50%
67%
8,192 MPI Processes: 32 ppn, 32-core HSW nodes.
7.2.1-DMAPP performs about 50% better
7.2.1-DMAPP with Shared-Memory optimization performs about 67% better
MPICH_SHARED_MEM_COLL_OPT = 1 (Default: 0. Soon to be enabled by default)
MPICH_USE_DMAPP_COLL = 1 (Default: 0) -Wl,--whole-archive,-ldmapp,--no-whole-archive
14 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 15: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/15.jpg)
MPI_Alltoall and MPI_Alltoallv Optimizations
Copyright 2015 Cray Inc 15
0
1000
2000
3000
4000
5000
6000
Ba
nd
wid
th M
B/s
ec/n
od
e
Message Size (bytes)
5X
4,096 MPI Processes. 24 ranks per node, 64M Hugepages.
Enabled by default. Optimized Alltoall/Alltoallv solutions perform up to 5X faster
(Set MPICH_GNI_COLL_OPT_OFF to disable)
Cray User Group, Apr. 2015
7.2.0-A2A-Default 7.0.0-A2A-Default 7.2.0-A2Av-Default 7.0.0-A2AvDefault
![Page 16: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/16.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized memcpy
- Collective Optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and Features
- User Level Fault Mitigation
● Summary and Future work
● Discussion
16 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 17: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/17.jpg)
MPI_Ialltoall (Large Messages)
8,192 MPI Processes: 24-Core Ivy-Bridge nodes
New prototype delivers up to 70% comm./comp. overlap
17
0
1000
2000
3000
4000
5000
8K 16K 32K 64K
Avg
. L
ate
ncy
(msec)
Message Length (Bytes) Comm. Latency
7.0.0 7.2.0
0
20
40
60
80
8K 16K 32K 64K
Overl
ap
(%
)
Message Length (Bytes) Comm./Comp. Overlap
Cray User Group, Apr. 2015
70%
MPICH_NEMESIS_ASYNC_PROGRESS=1;
MPICH_MAX_THREAD_SAFETY=multiple; (CLE 5.2 UP02 or newer)
Copyright 2015 Cray Inc
![Page 18: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/18.jpg)
MPI_Iallreduce (Small Messages)
8,192 MPI Processes: 24-Core Ivy-Bridge nodes
New Iallreduce optimization delivers up to 80% comm./comp. overlap export MPICH_NEMESIS_ASYNC_PROGRESS=1; export MPICH_SHARED_MEM_COLL_OPT=1
export MPICH_MAX_THREAD_SAFETY=multiple; export MPICH_USE_DMAPP_COLL=1
(CLE 5.2 UP02 or newer)
18
0
10
20
30
40
50
4 8 16
Avg
. L
ate
ncy
(usec)
Message Length (Bytes) Comm. Latency
7.0.0 7.2.0
-10
10
30
50
70
90
4 8 16
Overl
ap
(%
)
Message Length (Bytes) Comm./Comp. Overlap
Cray User Group, Apr. 2015
80%
Copyright 2015 Cray Inc
![Page 19: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/19.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized memcpy
- Collective Optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and Features
- User Level Fault Mitigation
● Summary and Future work
● Discussion
19 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 20: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/20.jpg)
MPI-3 RMA: 8-Byte MPI_Put Latency Network Atomic Memory Operations
0
0.5
1
1.5
2
2.5
3
8
Late
ncy (
usec)
Message Length (Bytes)
7.0.0 - Default 7.2.0 (DMAPP -- Network AMO's)
63%
7.2.0-DMAPP improves MPI-3 RMA small message latency by about 63%
Design uses Atomic Memory Operations offered by the Aries network.
MPICH_RMA_OVER_DMAPP = 1 (Default: 0)
MPICH_RMA_USE_NETWORK_AMO=1 (Default: 0)
-Wl,--whole-archive,-ldmapp,--no-whole-archive
20 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 21: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/21.jpg)
MPI-3 RMA GPU-to-GPU
01000200030004000500060007000
Ban
dw
idth
(M
B/s
)
Message Length (Bytes)
One-sided g2g Manual copy-through-host
50%
Optimized GPU-to-GPU communication for point-to-point and collectives
New prototype designs improve GPU-to-GPU RMA communication bandwidth
by about 50%. (Internal prototype available. Will be released in June 2015)
21 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 22: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/22.jpg)
MPI One-sided: On-node latency (Post Start Wait Complete)
0123456
0 1 2 4 8
16
32
64
12
8
25
6
51
2
1K
2K
4K
8K
16KL
ate
ncy (
usec)
Message Length (Bytes)
7.2.0 7.1.3
2 MPI Processes. 1 Ivy-Bridge compute node (24-core)
On-Node MPI-3 one-sided operations perform about 87% better with Cray
MPICH 7.2.0 (Enabled by default)
Cray User Group, Apr. 2015
87%
Copyright 2015 Cray Inc 22
![Page 23: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/23.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized memcpy
- Collective Optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and Features
- User Level Fault Mitigation
● Summary and Future work
● Discussion
23 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 24: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/24.jpg)
020406080
100120140160
0 2 8
32
12
8
512
2K
8K
32K
128K
Late
ncy (
usec)
Message Length (Bytes)
7.0.0 7.2.0 (FG-POBJ)
52%
Osu_latency_mt.c benchmark. 1 ppn, 2 nodes, 31 threads per process
MPICH_MAX_THREAD_SAFETY=multiple
cc –o osu_latency_mt.x –craympich-mt osu_latency_mt.c
aprun –n2 –N1 –d32 ./osu_latency_mt.x
01000200030004000500060007000
0 2 8
32
128
512
2K
8K
32K
12
8KL
ate
ncy (
usec)
Message Length (Bytes)
352us 489us
Fine-Grained Multi-threading for MPI
32-core Haswell KNC
24 Cray User Group, Apr. 2015
>10X
Copyright 2015 Cray Inc
![Page 25: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/25.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized memcpy
- Collective Optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and Features
- User Level Fault Mitigation
● Summary and Future work
● Discussion
25 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 26: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/26.jpg)
MPI-I/O STATS
26
Timeline of MPI-I/O statistics. Many different variables tracked export MPICH_MPIIO_STATS=2
For more information contact: David Knaak, Bob Cernohous
(http://mptwiki.us.cray.com/pmwiki/pmwiki.php?n=MPI-IO.S-0013-20)
Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 27: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/27.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized memcpy
- Collective Optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and Features
- User Level Fault Mitigation
● Summary and Future work
● Discussion
27 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 28: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/28.jpg)
Cray SHMEM: Features and Optimizations
05000
100001500020000250003000035000
Late
ncy (
usec)
Message Length (KBytes)
Cray SHMEM-7.2.0 Cray SHMEM-7.0.0
0
10
20
30
40
508
16
32
64
128
256
512
1K
2KL
ate
ncy (
usec)
Message Length (Bytes)
• Shmem_broadcast (about 60% improvement in comm. latency)
1,024 MPI Processes: 24-Core Ivy-Bridge nodes (Enabled by default)
• Cray SHMEM supports shmem_global_exit()
(Not enabled by default: Set SHMEM_GLOBAL_EXIT=1)
Cray User Group, Apr. 2015
64% 60%
28 Copyright 2015 Cray Inc
![Page 29: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/29.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized memcpy
- Collective Optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and Features
- MPI User Level Fault Mitigation
● Summary and Future work
● Discussion
29 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 30: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/30.jpg)
MPI User Level Fault Mitigation (ULFM)
● Critical to design MPI libraries in a fault resilient manner ● MPI Fault Tolerance Working Group is working towards
standardizing the ULFM interface ● A prototype implementation of ULFM in Cray MPICH for XC
systems is under development ● Status of the current prototype: - supports the new MPI_ERR error classes - automatically detects node failures - supports a subset of ULFM functions to re-build MPI objects (revoke, shrink, agree..)
● Future development plans depend on MPI Forum directions
Cray User Group, Apr. 2015 30
Copyright 2015 Cray Inc
![Page 31: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/31.jpg)
Agenda ● Introduction
● Recent Cray MPI and Cray SHMEM optimizations and features:
- Architecture Optimized Memcpy
- Collective optimizations
Blocking ( Communication Latency)
Non-Blocking (Comm./Comp. Overlap)
- MPI-3 RMA and GPU Optimizations
- Fine-grained Multi-threading for MPI
- MPI-IO
- Cray SHMEM Optimizations and features
- MPI User Level Fault Mitigation
● Summary and Future work
● Discussion
31 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 32: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/32.jpg)
Summary and Future Work
● Conclusion: ● Improved MPI Point-to-Point, Collective and RMA performance
● New features and optimizations in Cray SHMEM
● Exploring architecture-specific optimizations for Intel Xeon and Intel Xeon Phi
● Prototyping ULFM support in Cray MPICH
● Future Work: ● Improving Cray MPICH and Cray SHMEM performance on future
Intel Xeon and Intel Xeon Phi processors
● Reduced memory footprint for Cray MPICH
● Improved support for multi-threaded MPI applications
Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
32
![Page 33: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/33.jpg)
33 Copyright 2015 Cray Inc Cray User Group, Apr. 2015
![Page 34: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/34.jpg)
34 Copyright 2015 Cray Inc Cray User Group, Apr. 2015
![Page 35: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/35.jpg)
MPI_Ialltoallv (Large Messages)
4,096 MPI Processes: 24-Core Ivy-Bridge nodes
Improved comm./comp. overlap (by up to 80%)
35
0
1000
2000
3000
4000
5000
8K 16K 32K 64K 512K
Avg
. L
ate
ncy
(msec)
Message Length (Bytes) Comm. Latency
7.0.0-Async 7.2.0-Async
0
20
40
60
80
100
8K 16K 32K 64K 128K
Overl
ap
(%
)
Message Length (Bytes) Comm./Comp. Overlap
Cray User Group, Apr. 2015
80%
MPICH_NEMESIS_ASYNC_PROGRESS=1;
MPICH_MAX_THREAD_SAFETY=multiple; (CLE 5.2 UP02 or newer)
Copyright 2015 Cray Inc
![Page 36: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/36.jpg)
MPI_Allgather and MPI_Allgatherv
0200400600800
10001200
512K 1M
Avg
. L
ate
ncy (
msec)
Message Length (Bytes)
7.0.0 7.0.3
1024 MPI Processes:
MPT 7.0.3 performs about 35-45% better.
25.9%
35%
Enabled by Default (Aries/Gemini)
36
0100200300400500600700800
256K 512K
Avg
. L
ate
ncy (
msec)
Message Length (Bytes)
35%
512 MPI Processes:
MPT 7.0.3 performs about 25-35% better.
45%
Allgatherv Latency Comparison
(Gemini)
Allgather Latency Comparison
(Aries)
Cray User Group, Apr. 2015 Copyright 2015 Cray Inc
![Page 37: K. Kandalla, D. Knaak, K. McMahon, N. Radcliffe and M. Pagel · Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates](https://reader034.fdocuments.in/reader034/viewer/2022050215/5f60bed6a6780c00b755ecc4/html5/thumbnails/37.jpg)
Global lock vs Per-Object locks
Pthread_lock
(Global_mutex)
MPI_Send
T1 T2 T3 T4
Pthread_unlock
(Global_mutex)
MPID_Send()
Progress_engine()
GNI netmod
Pthread_lock
(msgq_mutex)
Pthread_lock
(ch3comm_mutex)
Pthread_lock
(global_mutex)
T1 T2 T3 T4
Progress_engine()
GNI netmod
Default:
- MPI library uses a single global_mutex
MPI_Send
pthread_mutex_lock(global_mutex)
MPID_Send()
MPID_Progress_wait()
pthread_mutex_unlock(global_mutex)
Per-Obj (Fine-Grained):
- MPI library still relies on a global_mutex.
Along with several smaller locks:
msgq_mutex, handle_mutex, comp_mutex
- Global critical sections are now smaller
Fine-Grained Multi-threading for MPI
37 Cray User Group, Apr. 2015 Copyright 2015 Cray Inc