ORNL is managed by UT-Battelle, LLC for the US Department of Energy
An Evaluation of the CORAL Interconnects
Chris Zimmer†, Scott Atchley†, Ramesh Pankajakshan∗, Brian E. Smith†, Ian Karlin∗, Matthew L. Leininger∗,Adam Bertsch∗, Brian S. Ryujin∗, Jason Burmark∗,Andre Walker-Loud⋄, M. A. Clark§, Olga Pearce∗
†ORNL, ∗LLNL, ⋄LBL, §NVIDIA Corporation
2
• Collaboration of Oak Ridge, Argonne, and Livermore• Single Request For Proposals (RFP) for three pre-exascale
systems
• RFP released in 2014
• ORNL and LLNL– IBM, NVIDIA, Mellanox– Installed in 2017-2018– Production in 2019
CORAL
SummitIBM 200 PF
4,608 nodes6 NVIDIA V100 GPUs
2 POWER9 CPUs2 ports Mellanox EDR
1:1 Fat Tree
SierraIBM 125 PF
4,320 nodes4 NVIDIA V100 GPUs
2 POWER9 CPUs2 ports Mellanox EDR
2:1 Fat Tree
3
Similar systems but not identical
P9 P9
DRAM256 GBH
BM16
GB
GPU 7 TF
HBM
16 G
B
GPU 7 TF
HBM
16 G
B
GPU 7 TF
DRAM128 GB H
BM16
GB
GPU 7 TF
HBM
16 G
B
GPU 7 TF
NIC
HBM/DRAM Bus (aggregate B/W)NVLINK
X-Bus (SMP)PCIe Gen4
EDR IB
NVM6.0 GB/s Read2.1 GB/s Write
Summit Half Node Sierra Half Node
12.5
GB/
s
12.5
GB/
s
16 GB/s 16
GB/
s
64 GB/s
170
GB/
s
170
GB/
s
50 G
B/s
50 GB/s
50 GB/s
75 G
B/s
50 G
B/s
50 G
B/s
75 G
B/s
900
GB/
s90
0 G
B/s
900
GB/
s
900
GB/
s90
0 G
B/s
Summit Sierra
Nodes 4,608 4,320
System Memoryper node
512 GB 256 GB
GPUs per node 6 4
NVLink BWper link 50 GB/s 75 GB/s
Topology 1:1Fat Tree
2:1Fat Tree
4
Multi-Host leads to four Virtual Ports
64 GB/s
Socket 1Socket 0
HCA16 GB/s 16
GB/s
12.5
GB/
s
12.5
GB/
s
P0 P1
V1 V2 V3V0
• Both sockets enumerate the PCIebus and see two ports
• Linux lists four virtual ports– V0 and V2 → P0– V1 and V3 → P1
• The HCA has a total of 25 GB/s
• Each socket has 16 GB/s to the HCA– A single socket cannot drive 25 GB/s
using only the local PCIe connection
5
64 GB/s64 GB/s 64 GB/s
Socket 1Socket 0Socket 1Socket 0 Socket 1 Socket 0 Socket 1Socket 003 30
P0 P1
V0 V1 V2 V3
16 GB/s 16 G
B/s16 GB/s 16 G
B/s 16 GB/s 16 G
B/s
64 GB/s
16 GB/s 16 G
B/s
0 3 01 3201 23
P0 P1
V0 V1 V2 V3
P0 P1
V1 V2 V3
P0 P1
V0 V1 V2 V3V0
Using Virtual Ports – Default (no striping)• Spectrum MPI defaults to no-striping• With process affinity (socket 0 -> V0, socket 1->V3)• We call this policy 03
03
6
64 GB/s64 GB/s 64 GB/s
Socket 1Socket 0Socket 1Socket 0 Socket 1 Socket 0 Socket 1Socket 003 30
P0 P1
V0 V1 V2 V3
16 GB/s 16 G
B/s16 GB/s 16 G
B/s 16 GB/s 16 G
B/s
64 GB/s
16 GB/s 16 G
B/s
0 3 01 3201 23
P0 P1
V0 V1 V2 V3
P0 P1
V1 V2 V3
P0 P1
V0 V1 V2 V3V0
Using Virtual Ports – Striping• Spectrum MPI’s stripes messages over 64 KiB• Messages, ≤64 KiB, are sent over the same physical port (P0)• We call this policy 0123
03 0123
7
64 GB/s64 GB/s 64 GB/s
Socket 1Socket 0Socket 1Socket 0 Socket 1 Socket 0 Socket 1Socket 003 30
P0 P1
V0 V1 V2 V3
16 GB/s 16 G
B/s16 GB/s 16 G
B/s 16 GB/s 16 G
B/s
64 GB/s
16 GB/s 16 G
B/s
0 3 01 3201 23
P0 P1
V0 V1 V2 V3
P0 P1
V1 V2 V3
P0 P1
V0 V1 V2 V3V0
Using Virtual Ports – Striping (improved)• Spectrum MPI does not allow 0132
– IBM provided a special library to enable it
• Messages, ≤64 KiB, are sent over different physical ports03 0123 0132
8
64 GB/s64 GB/s 64 GB/s
Socket 1Socket 0Socket 1Socket 0 Socket 1 Socket 0 Socket 1Socket 003 30
P0 P1
V0 V1 V2 V3
16 GB/s 16 G
B/s16 GB/s 16 G
B/s 16 GB/s 16 G
B/s
64 GB/s
16 GB/s 16 G
B/s
0 3 01 3201 23
P0 P1
V0 V1 V2 V3
P0 P1
V1 V2 V3
P0 P1
V0 V1 V2 V3V0
Using Virtual Ports – Striping over X-Bus• Allows single process to achieve 25 GB/s
– At the cost of sending 50% of each message over the SMP bus
• We call this policy 033003 0123 0132 0330
9
Mellanox added several new features
• Adaptive Routing (SwitchIB-2)– Improves congestion management
• SHARP (SwitchIB-2)– Scalable Hierarchical Aggregation and Reduction Protocol– Switch-based collectives up to ~2 KB– Barrier, Broadcast, Reductions
• Hardware Tag Matching in the HCA (ConnectX-5)– Offloads small MPI two-sided messages– Offloads rendezvous progression
• NVMe-oF™ Offload (ConnectX-5) (not reviewed in this paper)
10
Adaptive Routing (A/R) helps significantly
• Tested using LLNL’s MPIGraph• Doubles effective bandwidth• Reduces variability
Summit Sierra2x BW
2x BW
2.3
GB/
s
13.4
GB/
s
11
Adaptive Routing (A/R) helps significantly
• Tested using LLNL’s MPIGraph• Doubles effective bandwidth• Reduces variability
Summit Sierra2x BW
2x BW
Some Summit runs would have a few ~300 MB/s measurements out of the ~20M measurements.
2.3
GB/
s
13.4
GB/
s
0.3
GB/
s
12
0
3
6
9
12
15
18
21
24
8 16 32 64 128 256 512
Tim
e (µs
)
Nodes
Barrier
OpenMPIIBM Spectrum MPISHARP™
72% lower
SHARP helps microbenchmarks…
• Dramatically reduces latency• Scales much better than Spectrum MPI or Open-MPI software-
based collectives
From “The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems”, SC18
0
20
40
60
80
100
120
140
160
8 16 32 64 128 256 512 1024 2048
Tim
e (µs
)
Nodes
Allreduce
OpenMPIIBM Spectrum MPI
SHARP™
79% lower
13Fi
gure
of M
erit
(Hig
her B
ette
r)
0.00E+00
2.00E+10
4.00E+10
6.00E+10
8.00E+10
Summit Default Summit w/SHARP
AMG on 256 nodes on Summit (10 runs)
• AMG on Sierra (4 GPUs) and Summit (6 GPUs)• 10 runs – With SHARP, performance is slower and erratic• Nekbone 1-2% slower ≤ 256, 1-2% faster 512 nodes, 1-2% variability• 2018 Gordon Bell finalist, Gamara/Mothra, saw major improvement
But SHARP has mixed results with real appsFi
gure
of M
erit
(Hig
her B
ette
r)
0.00E+00
5.00E+08
1.00E+09
1.50E+09
2.00E+09
1 2 3 4 5 6 7 8 9 10
4 GPU Base 4 GPU SHARP 6 GPU Base 6 GPU SHARP
Figu
re o
f Mer
it (H
ighe
r Bet
ter)
0.00E+00
5.00E+08
1.00E+09
1.50E+09
2.00E+09
Sierra Default
Sierra w/SHARP
Summit Default
Summit w/SHARP
AMG on 128 nodes on Sierra and Summit (10 runs)
High VariabilityLower Performance
14
Hardware Tag Matching – Latency is mixed
0
1
2
3
4
5
6
0 1 2 4 8 16 32 64 128
256
512
1024
2048
4096
8192
Late
ncy
(µs)
Message Size (Bytes)
Pingpong Latency
Base
HW TM
• No better or slightly higher for small messages
• Slightly lower over 512 bytes
• What about overhead for rendezvous progression?
15
Hardware Tag Matching – Provides excellent overlap
0
50
100
150
200
250
300
350
400
450
Base PAMI HW TM Progress Thread
Max
Ove
rhea
d (µ
s)
Sandia MPI Overhead Benchmark
SendRecv
• Significantly better than base
• Enabling progress thread provides similar overlap at the cost of using another core
• No benefit to AMG, HACC, UMT compared to base
16
Congestion management and Tail Latency
• Ran new GPCNeT benchmark on both systems– Global Performance and Congestion Network Test
• https://xgitlab.cels.anl.gov/networkbench/GPCNET– Measures representative workloads in isolation and with congestion
• Focused on the congestion test• “Victim” patterns (20% of allocation)– Random-Ring 8B Latency, 2-sided RR 1MB BW, 8B Allreduce
• Congestors (80% of allocation)– 2-sided Incast, 2-sided Bcast, 1-sided Incast, 1-sided Bcast, All-to-All
• Reports ratio of performance degradation (latency increase, BW decrease)– Tested striping policies, with and without A/R, tapering– For more GPCNeT details, there is a GPCNeT talk at 1:30 pm in 401-404
17
64 GB/s64 GB/s 64 GB/s
Socket 1Socket 0Socket 1Socket 0 Socket 1 Socket 0 Socket 1Socket 003 30
P0 P1
V0 V1 V2 V3
16 GB/s 16 G
B/s16 GB/s 16 G
B/s 16 GB/s 16 G
B/s
64 GB/s
16 GB/s 16 G
B/s
0 3 01 3201 23
P0 P1
V0 V1 V2 V3
P0 P1
V1 V2 V3
P0 P1
V0 V1 V2 V3V0
Congestion Management – Striping Policies
• Tested multiple policies withA/R on both system using~97% of the nodes– 03, 0123, 0132, 0330
• Ran 10 iterations of each policy– 03 and 0330 were clearly superior– The remaining results focus on
these two policies
03 0123 0132 0330
18
Congestion Management – Isolation
Mean Latencies 99th Percentile (tail) Latencies
• Very consistent• Policy does not matter• Tapering slightly increases latency
• Very consistent• except Summit had one outlier
• Policy does not matter• Tapering slightly increases latency
19
Congestion Management – Congestion
Mean Latencies 99th Percentile (tail) Latencies
• High variability – 35-50% CoV• Policy matters• 0330 is 65-78% lower
• Tapering increases mean latency by 30%
• High variability – 24-40% CoV• Policy matters• 0330 is 62-73% lower• 0330 helps reduce impact of
taper (only 4% more)
20
Congestion Management – with and without A/R
• Tested with the default policy 03• 20 runs each with and without A/R
• Using A/R adds latency– 99% tail latency is 76% higher with A/R (212.3 µs vs 374.8 µs)
• But helps bandwidth– 99% tail BW is 32% higher with A/R– Improves all-to-all and I/O performance
• Summit’s GPFS can only achieve performance targets with A/R– Both labs use A/R by default for MPI and GPFS traffic
21
Comments on GPCNeT
• The goal is to allow comparisons between interconnects– Provides a ratio of performance degradation for latency and
bandwidth for mean and 99th percentile
• Comparisons are only valid for systems of the same scale– It is not valid to compare small versus large systems
• The congestion tests are only valid when run at full scale– Small allocations may be too spread out such that the congestors
cannot affect the victims
• Need to run many congestion tests– Large variability in results – a single run is meaningless– Coefficient of variance (StdDev/Mean) was 20-50%
22
Tapering
• Tapering allowed LLNL to add 6-8% more nodes• Tested QUDA, SW4, Ares
• Only QUDA showed sensitivity– Limit to a single rack to avoid
tapering– https://github.com/lattice/quda/
• Other apps were sensitive toNVLink bandwidth
Right on the x-axis is betterTighter groups are better
23
Conclusions
• Both labs are happy with their interconnects– Achieves high percentage of peak bi-section bandwidth
• Adaptive Routing is a must• MPI Offloads (Tag Matching & SHARP)
– Perform well on MPI micro-benchmarks– Performance varies for proxy applications and production codes– Continuing co-design effort to improve for a wider variety of workloads
• Congestion and tail latency– EDR does a good job– Port policy plays a major role for IBM’s AC922
• Tapering can make sense – know your workload
24
Questions?
• This work was performed under the auspices of the U.S. DOE by Oak Ridge Leadership Computing Facility at ORNL under contract DE-AC05-00OR22725.
• Prepared by LLNL under Contract DE-AC52-07NA27344. LLNL-CONF-772398.
Top Related