Download - An Evaluation of the CORAL Interconnectssc19.supercomputing.org/.../pap239s5.pdf · 5 64 GB/s 64 GB/s 64 GB/s Socket 0 Socket 1 Socket 0 Socket 1 Socket 0 Socket 1 Socket 0 Socket

ORNL is managed by UT-Battelle, LLC for the US Department of Energy

An Evaluation of the CORAL Interconnects

Chris Zimmer†, Scott Atchley†, Ramesh Pankajakshan∗, Brian E. Smith†, Ian Karlin∗, Matthew L. Leininger∗,Adam Bertsch∗, Brian S. Ryujin∗, Jason Burmark∗,Andre Walker-Loud⋄, M. A. Clark§, Olga Pearce∗

†ORNL, ∗LLNL, ⋄LBL, §NVIDIA Corporation

2

• Collaboration of Oak Ridge, Argonne, and Livermore• Single Request For Proposals (RFP) for three pre-exascale

systems

• RFP released in 2014

• ORNL and LLNL– IBM, NVIDIA, Mellanox– Installed in 2017-2018– Production in 2019

CORAL

SummitIBM 200 PF

4,608 nodes6 NVIDIA V100 GPUs

2 POWER9 CPUs2 ports Mellanox EDR

1:1 Fat Tree

SierraIBM 125 PF

4,320 nodes4 NVIDIA V100 GPUs

2 POWER9 CPUs2 ports Mellanox EDR

2:1 Fat Tree

3

Similar systems but not identical

P9 P9

DRAM256 GBH

BM16

GB

GPU 7 TF

HBM

16 G

B

GPU 7 TF

HBM

16 G

B

GPU 7 TF

DRAM128 GB H

BM16

GB

GPU 7 TF

HBM

16 G

B

GPU 7 TF

NIC

HBM/DRAM Bus (aggregate B/W)NVLINK

X-Bus (SMP)PCIe Gen4

EDR IB

NVM6.0 GB/s Read2.1 GB/s Write

Summit Half Node Sierra Half Node

12.5

GB/

s

12.5

GB/

s

16 GB/s 16

GB/

s

64 GB/s

170

GB/

s

170

GB/

s

50 G

B/s

50 GB/s

50 GB/s

75 G

B/s

50 G

B/s

50 G

B/s

75 G

B/s

900

GB/

s90

0 G

B/s

900

GB/

s

900

GB/

s90

0 G

B/s

Summit Sierra

Nodes 4,608 4,320

System Memoryper node

512 GB 256 GB

GPUs per node 6 4

NVLink BWper link 50 GB/s 75 GB/s

Topology 1:1Fat Tree

2:1Fat Tree

4

Multi-Host leads to four Virtual Ports

64 GB/s

Socket 1Socket 0

HCA16 GB/s 16

GB/s

12.5

GB/

s

12.5

GB/

s

P0 P1

V1 V2 V3V0

• Both sockets enumerate the PCIebus and see two ports

• Linux lists four virtual ports– V0 and V2 → P0– V1 and V3 → P1

• The HCA has a total of 25 GB/s

• Each socket has 16 GB/s to the HCA– A single socket cannot drive 25 GB/s

using only the local PCIe connection

5

64 GB/s64 GB/s 64 GB/s

Socket 1Socket 0Socket 1Socket 0 Socket 1 Socket 0 Socket 1Socket 003 30

P0 P1

V0 V1 V2 V3

16 GB/s 16 G

B/s16 GB/s 16 G

B/s 16 GB/s 16 G

B/s

64 GB/s

16 GB/s 16 G

B/s

0 3 01 3201 23

P0 P1

V0 V1 V2 V3

P0 P1

V1 V2 V3

P0 P1

V0 V1 V2 V3V0

Using Virtual Ports – Default (no striping)• Spectrum MPI defaults to no-striping• With process affinity (socket 0 -> V0, socket 1->V3)• We call this policy 03

03

6



P0 P1

V0 V1 V2 V3

16 GB/s 16 G

B/s16 GB/s 16 G

B/s 16 GB/s 16 G

B/s

64 GB/s

16 GB/s 16 G

B/s

0 3 01 3201 23

P0 P1

V0 V1 V2 V3

P0 P1

V1 V2 V3

P0 P1

V0 V1 V2 V3V0

Using Virtual Ports – Striping• Spectrum MPI’s stripes messages over 64 KiB• Messages, ≤64 KiB, are sent over the same physical port (P0)• We call this policy 0123

03 0123

7



P0 P1

V0 V1 V2 V3

16 GB/s 16 G

B/s16 GB/s 16 G

B/s 16 GB/s 16 G

B/s

64 GB/s

16 GB/s 16 G

B/s

0 3 01 3201 23

P0 P1

V0 V1 V2 V3

P0 P1

V1 V2 V3

P0 P1

V0 V1 V2 V3V0

Using Virtual Ports – Striping (improved)• Spectrum MPI does not allow 0132

– IBM provided a special library to enable it

• Messages, ≤64 KiB, are sent over different physical ports03 0123 0132

8



P0 P1

V0 V1 V2 V3

16 GB/s 16 G

B/s16 GB/s 16 G

B/s 16 GB/s 16 G

B/s

64 GB/s

16 GB/s 16 G

B/s

0 3 01 3201 23

P0 P1

V0 V1 V2 V3

P0 P1

V1 V2 V3

P0 P1

V0 V1 V2 V3V0

Using Virtual Ports – Striping over X-Bus• Allows single process to achieve 25 GB/s

– At the cost of sending 50% of each message over the SMP bus

• We call this policy 033003 0123 0132 0330

9

Mellanox added several new features

• Adaptive Routing (SwitchIB-2)– Improves congestion management

• SHARP (SwitchIB-2)– Scalable Hierarchical Aggregation and Reduction Protocol– Switch-based collectives up to ~2 KB– Barrier, Broadcast, Reductions

• Hardware Tag Matching in the HCA (ConnectX-5)– Offloads small MPI two-sided messages– Offloads rendezvous progression

• NVMe-oF™ Offload (ConnectX-5) (not reviewed in this paper)

10

Adaptive Routing (A/R) helps significantly

• Tested using LLNL’s MPIGraph• Doubles effective bandwidth• Reduces variability

Summit Sierra2x BW

2x BW

2.3

GB/

s

13.4

GB/

s

11

Adaptive Routing (A/R) helps significantly

• Tested using LLNL’s MPIGraph• Doubles effective bandwidth• Reduces variability

Summit Sierra2x BW

2x BW

Some Summit runs would have a few ~300 MB/s measurements out of the ~20M measurements.

2.3

GB/

s

13.4

GB/

s

0.3

GB/

s

12

0

3

6

9

12

15

18

21

24

8 16 32 64 128 256 512

Tim

e (µs

)

Nodes

Barrier

OpenMPIIBM Spectrum MPISHARP™

72% lower

SHARP helps microbenchmarks…

• Dramatically reduces latency• Scales much better than Spectrum MPI or Open-MPI software-

based collectives

From “The Design, Deployment, and Evaluation of the CORAL Pre-Exascale Systems”, SC18

0

20

40

60

80

100

120

140

160

8 16 32 64 128 256 512 1024 2048

Tim

e (µs

)

Nodes

Allreduce

OpenMPIIBM Spectrum MPI

SHARP™

79% lower

13Fi

gure

of M

erit

(Hig

her B

ette

r)

0.00E+00

2.00E+10

4.00E+10

6.00E+10

8.00E+10

Summit Default Summit w/SHARP

AMG on 256 nodes on Summit (10 runs)

• AMG on Sierra (4 GPUs) and Summit (6 GPUs)• 10 runs – With SHARP, performance is slower and erratic• Nekbone 1-2% slower ≤ 256, 1-2% faster 512 nodes, 1-2% variability• 2018 Gordon Bell finalist, Gamara/Mothra, saw major improvement

But SHARP has mixed results with real appsFi

gure

of M

erit

(Hig

her B

ette

r)

0.00E+00

5.00E+08

1.00E+09

1.50E+09

2.00E+09

1 2 3 4 5 6 7 8 9 10

4 GPU Base 4 GPU SHARP 6 GPU Base 6 GPU SHARP

Figu

re o

f Mer

it (H

ighe

r Bet

ter)

0.00E+00

5.00E+08

1.00E+09

1.50E+09

2.00E+09

Sierra Default

Sierra w/SHARP

Summit Default

Summit w/SHARP

AMG on 128 nodes on Sierra and Summit (10 runs)

High VariabilityLower Performance

14

Hardware Tag Matching – Latency is mixed

0

1

2

3

4

5

6

0 1 2 4 8 16 32 64 128

256

512

1024

2048

4096

8192

Late

ncy

(µs)

Message Size (Bytes)

Pingpong Latency

Base

HW TM

• No better or slightly higher for small messages

• Slightly lower over 512 bytes

• What about overhead for rendezvous progression?

15

Hardware Tag Matching – Provides excellent overlap

0

50

100

150

200

250

300

350

400

450

Base PAMI HW TM Progress Thread

Max

Ove

rhea

d (µ

s)

Sandia MPI Overhead Benchmark

SendRecv

• Significantly better than base

• Enabling progress thread provides similar overlap at the cost of using another core

• No benefit to AMG, HACC, UMT compared to base

16

Congestion management and Tail Latency

• Ran new GPCNeT benchmark on both systems– Global Performance and Congestion Network Test

• https://xgitlab.cels.anl.gov/networkbench/GPCNET– Measures representative workloads in isolation and with congestion

• Focused on the congestion test• “Victim” patterns (20% of allocation)– Random-Ring 8B Latency, 2-sided RR 1MB BW, 8B Allreduce

• Congestors (80% of allocation)– 2-sided Incast, 2-sided Bcast, 1-sided Incast, 1-sided Bcast, All-to-All

• Reports ratio of performance degradation (latency increase, BW decrease)– Tested striping policies, with and without A/R, tapering– For more GPCNeT details, there is a GPCNeT talk at 1:30 pm in 401-404

17



P0 P1

V0 V1 V2 V3

16 GB/s 16 G

B/s16 GB/s 16 G

B/s 16 GB/s 16 G

B/s

64 GB/s

16 GB/s 16 G

B/s

0 3 01 3201 23

P0 P1

V0 V1 V2 V3

P0 P1

V1 V2 V3

P0 P1

V0 V1 V2 V3V0

Congestion Management – Striping Policies

• Tested multiple policies withA/R on both system using~97% of the nodes– 03, 0123, 0132, 0330

• Ran 10 iterations of each policy– 03 and 0330 were clearly superior– The remaining results focus on

these two policies

03 0123 0132 0330

18

Congestion Management – Isolation

Mean Latencies 99th Percentile (tail) Latencies

• Very consistent• Policy does not matter• Tapering slightly increases latency

• Very consistent• except Summit had one outlier

• Policy does not matter• Tapering slightly increases latency

19

Congestion Management – Congestion

Mean Latencies 99th Percentile (tail) Latencies

• High variability – 35-50% CoV• Policy matters• 0330 is 65-78% lower

• Tapering increases mean latency by 30%

• High variability – 24-40% CoV• Policy matters• 0330 is 62-73% lower• 0330 helps reduce impact of

taper (only 4% more)

20

Congestion Management – with and without A/R

• Tested with the default policy 03• 20 runs each with and without A/R

• Using A/R adds latency– 99% tail latency is 76% higher with A/R (212.3 µs vs 374.8 µs)

• But helps bandwidth– 99% tail BW is 32% higher with A/R– Improves all-to-all and I/O performance

• Summit’s GPFS can only achieve performance targets with A/R– Both labs use A/R by default for MPI and GPFS traffic

21

Comments on GPCNeT

• The goal is to allow comparisons between interconnects– Provides a ratio of performance degradation for latency and

bandwidth for mean and 99th percentile

• Comparisons are only valid for systems of the same scale– It is not valid to compare small versus large systems

• The congestion tests are only valid when run at full scale– Small allocations may be too spread out such that the congestors

cannot affect the victims

• Need to run many congestion tests– Large variability in results – a single run is meaningless– Coefficient of variance (StdDev/Mean) was 20-50%

22

Tapering

• Tapering allowed LLNL to add 6-8% more nodes• Tested QUDA, SW4, Ares

• Only QUDA showed sensitivity– Limit to a single rack to avoid

tapering– https://github.com/lattice/quda/

• Other apps were sensitive toNVLink bandwidth

Right on the x-axis is betterTighter groups are better

23

Conclusions

• Both labs are happy with their interconnects– Achieves high percentage of peak bi-section bandwidth

• Adaptive Routing is a must• MPI Offloads (Tag Matching & SHARP)

– Perform well on MPI micro-benchmarks– Performance varies for proxy applications and production codes– Continuing co-design effort to improve for a wider variety of workloads

• Congestion and tail latency– EDR does a good job– Port policy plays a major role for IBM’s AC922

• Tapering can make sense – know your workload

24

Questions?

• This work was performed under the auspices of the U.S. DOE by Oak Ridge Leadership Computing Facility at ORNL under contract DE-AC05-00OR22725.

• Prepared by LLNL under Contract DE-AC52-07NA27344. LLNL-CONF-772398.