Scaling Results From Isambard: the First...

26
Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof Simon McIntosh-Smith Isambard PI University of Bristol / GW4 Alliance

Transcript of Scaling Results From Isambard: the First...

Page 1: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Scaling Results From Isambard: the First Generation of Arm-based Supercomputers

Prof Simon McIntosh-SmithIsambard PIUniversity of Bristol /GW4 Alliance

Page 2: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Isambard system specification• 10,752 Armv8 cores (168n x 2s x 32c)

• Cavium ThunderX2 32core 2.1à2.5GHz• Cray XC50 ‘Scout’ form factor• High-speed Aries interconnect• Cray HPC optimised software stack

• CCE, Cray MPI, math libraries, CrayPAT, …• Phase 2 (the Arm part):

• Delivered Oct 22nd, handed over Oct 29th

• Accepted Nov 9th

• Upgrade to final B2 TX2 silicon, firmware, CPE completed March 15th 2019

http://gw4.ac.uk/isambard/

Page 3: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof
Page 4: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Cavium ThunderX2, a seriously beefy CPU• 32 cores at up to 2.5GHz• Each core is 4-way superscalar, Out-of-Order• 32KB L1, 256KB L2 per core• Shared 32MB L3• Dual 128-bit wide NEON vectors

• Compared to Skylake’s 512-bit vectors, and Broadwell’s 256-bit vectors• 8 channels of 2666MHz DDR4

• Compared to 6 channels on Skylake, 4 channels on Broadwell• AMD’s EPYC also has 8 channels

http://gw4.ac.uk/isambard/

Page 5: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Recap of Single Node results from CUG 2018

http://gw4.ac.uk/isambard/

Page 6: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

SKL 20c Intel Skylake Gold 6148, $3,078 eachTX2 32c Cavium ThunderX2, $1,795 each (near top-bin)

6 S. McIntosh-Smith et al

Processor Cores Clock TDP FP64 Bandwidthspeed Watts TFLOP/s GB/sGHz

Broadwell 2⇥ 22 2.2 145 1.55 154Skylake Gold 2⇥ 20 2.4 150 3.07 256Skylake Platinum 2⇥ 28 2.1 165 3.76 256ThunderX2 2⇥ 32 2.2 175 1.13 320

TABLE 1 Hardware information (peak �gures)

Cores

TFLOPS/s

L1bandwidth

(agg.TB/s)

L2bandwidth

(agg.TB/s)

L3bandwidth

(agg.GB/s)

Memory b

and-

width (GB/s)

0

0.5

1

1.5

2

2.5

44 1.55 6.31 2.23 726 131.2

56

3.76

11.18

3.57

767.2

214.9

64

1.13

3.46

2.14

537.6

253.4

Relativ

e�g

ures

ofmerit

(normalize

dto

Broa

dwell)

Broadwell 22c Skylake 28c ThunderX2 32c

FIGURE 2 Comparison of properties of Broadwell 22c, Skylake 28c and ThunderX2 32c. Results are normalized to Broadwell.

that achieved the highest performance in each case was used in the results graphs displayed below. Likewise for the Intel processors, we usedGCC 7, Intel 2018, and Cray CCE 8.5–8.7. Table 2 lists the compiler that achieved the highest performance for each benchmark in this study.

4.2 Mini-apps

Figure 3 compares the performance of our target platforms over a range of representative mini-applications.STREAM: The STREAM benchmark measures the sustained memory bandwidth from the main memory. For the processors tested, the available

memory bandwidth is essentially determined by the number of memory controllers. Intel Xeon Broadwell and Skylake processors have four and sixmemory controllers per socket, respectively. The Cavium ThunderX2 processor has eight memory controllers per socket. The results in Figure 3show a clear trend that Skylake achieves a 1.64⇥ improvement over Broadwell, which is to be expected, given Skylake’s faster memory speed

Benchmarking platforms

http://gw4.ac.uk/isambard/

Page 7: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Previous single node performance results

CP2K

GROMACS

NAMD

NEMO

OpenFOAM

OpenSBLI

Unified

Model

VASP

GeometricMean

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

1 1 1 1 1 1 1 1 1

1.29

1.29

0.98

1.44 1.57

1.39

1.06

1.32

1.281.37 1.45

1.21

1.65

1.66

1.72

1.19

1.42

1.45

1.15

0.68

1.16

1.49

1.87

1.69

0.92

0.76

1.15

Perform

ance

(normalized

toBroad

well)

Broadwell 22c Skylake 20c Skylake 28c ThunderX2 32c

https://github.com/UoB-HPC/benchmarks

Page 8: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Scalability comparisons• We’ve plotted results using ‘Scaling (parallel) efficiency’• We’ve compared against two x86-based XC50 systems:

• Horizon using Intel Skylake Gold 6148 20-core CPUs at 2.4GHz• Swan using Intel Skylake Platinum 8176 28-core CPUs at 2.1GHz• Could only go up to 64 nodes on these systems, though we could have

gone up to 164 on Isambard• All the results are for strong scaling, except SNAP• All of these systems use the same interconnect (Aries) and the

same O/S and MPI library, so this is a good test of whether Arm-based ThunderX2 scales as well as x86

http://gw4.ac.uk/isambard/

Page 9: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

CloverLeaf scaling – relative performance

http://gw4.ac.uk/isambard/

Page 10: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

CloverLeaf scaling – parallel efficiency

http://gw4.ac.uk/isambard/

Page 11: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

TeaLeaf scaling – relative performance

http://gw4.ac.uk/isambard/

Page 12: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

TeaLeaf scaling – parallel efficiency

http://gw4.ac.uk/isambard/

Page 13: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

SNAP scaling – relative performance

http://gw4.ac.uk/isambard/

Page 14: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

SNAP scaling – parallel efficiency

http://gw4.ac.uk/isambard/

Page 15: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

GROMACS scaling – relative performance

http://gw4.ac.uk/isambard/

Page 16: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

GROMACS scaling – parallel efficiency

http://gw4.ac.uk/isambard/

Page 17: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

NEMO scaling

http://gw4.ac.uk/isambard/

Parallel efficiency Relative performance

Page 18: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

OpenFOAM scaling

http://gw4.ac.uk/isambard/

Parallel efficiency Relative performance

Page 19: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

OpenSBLI scaling

http://gw4.ac.uk/isambard/

Parallel efficiency Relative performance

Page 20: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

VASP scaling

http://gw4.ac.uk/isambard/

Parallel efficiency Relative performance

Page 21: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Which compilers were best in each case?

http://gw4.ac.uk/isambard/

Page 22: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Isambard scaling summary• Arm-based systems appear to scale just as well as x86 ones• For certain codes that were compute-bound at low scale, these

became network bound at ‘real’ scale, levelling the playing field• We’re seeing a minor issue with scaling in two cases, appears to

be related to MPI collectives – investigations are underway• The software stack has been robust, reliable and high-quality

(both the commercial and open source parts)• Now have evidence that Arm-based systems are real alternatives

for HPC, reintroducing much needed competition to the market

http://gw4.ac.uk/isambard/

Page 23: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Tom Deakin

The Bristol HPC team doing this work

Andrei PoenaruJames Price

Also thanks go to:• The Isambard project members: the GW4 Alliance, the Met Office, Arm, Marvell and Cray• Cray for access to the Swan and Horizon x86 systems• EPSRC for funding the project

Page 24: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

For more information

Comparative Benchmarking of the First Generation of HPC-Optimised Arm Processors on Isambard S. McIntosh-Smith, J. Price, T. Deakin and A. Poenaru, CUG 2018, Stockholm

http://uob-hpc.github.io/2018/05/23/CUG18.html

Bristol HPC group: https://uob-hpc.github.io/

Isambard: http://gw4.ac.uk/isambard/

Build and run scripts: https://github.com/UoB-HPC/benchmarks

http://gw4.ac.uk/isambard/

Page 25: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Backup

http://gw4.ac.uk/isambard/

Page 26: Scaling Results From Isambard: the First …uob-hpc.github.io/assets/Isambard_Full_Paper_CUG_May...Scaling Results From Isambard: the First Generation of Arm-based Supercomputers Prof

Comparison of compilers

on Arm

http://gw4.ac.uk/isambard/