VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global...

26
1 www.esi-group.com Copyright © ESI Group, 2019. All rights reserved. VPS and Scalability on ARM Arm HPC User’s Group at ISC’19 Greg Samonds and Thorsten Queckboerner ESI Group

Transcript of VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global...

Page 1: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

1www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

VPS and Scalability on ARM

Arm HPC User’s Group at ISC’19

Greg Samonds and Thorsten Queckboerner

ESI Group

Page 2: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

2www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Agenda

VPS and Scalability on ARM

• Overview and introduction

• VPS – Major characteristics and typical applications

• Tuning activities – A historical overview

• HPC relevant code characteristics

• VPS explicit on ARM

• Compiler version comparison – Runtime and compilation speed

• Compiler options and runtime

• Scalability VPS explicit on ARM

Page 3: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

3www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Agenda

VPS and Scalability on ARM

• Overview and introduction

• VPS – Major characteristics and typical applications

• Tuning activities – A historical overview

• HPC relevant code characteristics

• VPS explicit on ARM

• Compiler version comparison – Runtime and compilation speed

• Compiler options and runtime

• Scalability VPS explicit on ARM

Page 4: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

4www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

What our Customers are doing

VPS – Virtual Performance Solution

• Characteristics of VPS (formerly also called PAM-CRASH)

• Structural mechanics FE solver

• Explicit solution scheme

• Implicit solution schemes

• General purpose Code with a strong focus on Automotive Virtual Prototyping

• Typical Customer applications

• Crash and Safety simulation

• Dummies and Human models

• Airbag unfolding with gas dynamics (embedded FPM CFD solver)

• Static load cases

• Implicit dynamics, NVH, acoustics

Page 5: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

5www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Virtual Prototyping - Crash Analysis

VPS – Virtual Performance Solution

Page 6: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

6www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

VPS – Virtual Performance SolutionVirtual Prototyping – Safety and Dummies

• Objective

• Virtual assessment of injury risks and safety ratings

• Realistic representation of deformation, injury, kinematics and sensor readings needed

• Dummy development and calibration

• Set up of experimental program for calibration of numerical model

• Validation of material, component and full body

Full body barrier tests

Source: http://www.mynrma.com.au/motoring-services/reviews/ancap/small-cars/mitsubishi-mirage-2013-ancap.htm

Injury risk example

Source: http://www.euroncap.com/en

Safety rating example

Source: https://www.safetywissen.com

Test setup example

Material tests

Component tests Full body pendulum tests

Page 7: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

7www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Virtual Prototyping - Airbag Modelling

VPS – Virtual Performance Solution

Driver airbag unfolding out of a steering wheel

Driver airbag unfolding out of a steering wheel (cut view)FPM velocity vectors

Page 8: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

8www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Virtual Prototyping - Statics & Dynamics & Acoustics

VPS – Virtual Performance Solution

• ResultsStatic loading

Displacements Stresses Torsion angle along vehicle length

Bending Test

Deformation/Strength

Indentation Test(Oil-Canning)

Buckling/ Remaining deformation

Misuse

Forces of retainer clipsStresses/

Remaining deformation

Creep

Remanent deformation after climate chamber

t

T

RT

90°C

Page 9: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

9www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Virtual Prototyping - Statics & Dynamics & Acoustics

VPS – Virtual Performance Solution

DYNAMICS

• Objective

• Improve structural dynamic strength

• Increase comfort during operation

• Noise

• Vibrations

• Methodology

• Linear dynamics

• Structural vibrations (FE)

• Acoustics in/outside of cavity (FE/BEM)

• Coupled vibro-acoustics (FE-FE, FE-BEM)

Eigenfrequencies/-modes

Resonance frequencies

Local dynamic stiffness

Frequency dependent stiffness

Coupled fluid-structure-trim

Interior/exterior sound pressure level Transfer path analysis

Page 10: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

10www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Usage at Customers/HPC-Centers/Partners

VPS – Virtual Performance Solution

• Automotive OEM: HPC usage (12 months):

• VPS-explicit 57 %

• openfoam 28 %

• foamapps 6 %

• starccm 4 %

• abaqus 2 %

• mentalray < 1 %

• starcd < 1 %

• nastran < 1 %

• vectris < 1 %

• aura < 1 %

• caebench < 1 %

Page 11: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

11www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Agenda

VPS and Scalability on ARM

• Overview and introduction

• VPS – Major characteristics and typical applications

• Tuning activities – A historical overview

• HPC relevant code characteristics

• VPS explicit on ARM

• Compiler version comparison – Runtime and compilation speed

• Compiler options and runtime

• Scalability VPS explicit on ARM

Page 12: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

12www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Tuning Activities

VPS – Virtual Performance Solution

• Constantly increasing requirements with regard to computing load• Model size increase

• More load cases

• New applications

• …

• Therefore, code tuning is daily business• Since more then 30 years

• On different areas• Sequential

• Vector

• SMP - OpenMP

• DMP - MPI

• With all major HPC vendors/players (alphabetic order)• CRAY, COMPAQ, Fujitsu, IBM, HP, SGI, SUN, NEC …

• AMD, ARM, INTEL …

Page 13: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

13www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Tuning Activities

VPS – Virtual Performance Solution

• Sequential tuning• Support new chip features

• Vector computer

• AVX units (x86)

• Code modifications + Compiler features

• Memory access

• Tuning on cache usage

• Direct access and data alignment

• Algorithms• Faster algorithms

• But keep memory consumption in mind

Page 14: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

14www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Tuning Activities

VPS – Virtual Performance Solution

• SMP tuning (OpenMP)• Load distribution

• Thread data grain size

• Thread load balance

• Reduce/avoid locks

• Thread and NUMA aware memory management

• DMP tuning (MPI)• Minimize number of communications: Interconnect latency

• Minimize size of messages: Interconnect bandwidth

• Minimize load unbalance• Static load unbalance (Domain decomposition)

• Dynamic load unbalance (Open issue)

Page 15: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

15www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

1980 20001990 2010

Simulation: ResearchCode: SequentialHardware: Vector (Cray, NEC)

Simulation: Industrial, verificationCode: SMP (openMP)Hardware: Risc (SGI, HP, IBM)

Simulation: PredictiveCode: DMP (MPI)Hardware: lean x86 cluster

Simulation: Daily businessCode: hybrid SMP-DMPHardware: fat x86 cluster

week

day

hour

VPS – Virtual Performance Solution

Evolution: Simulation, Software and Hardware evolution (explicit)

2020

Simulation: Daily businessCode: hybrid SMP-DMPHardware: many core cluster

Page 16: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

16www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Agenda

VPS and Scalability on ARM

• Overview and introduction

• VPS – Major characteristics and typical applications

• Tuning activities – A historical overview

• HPC relevant code characteristics

• VPS explicit on ARM

• Compiler version comparison – Runtime and compilation speed

• Compiler options and runtime

• Scalability VPS explicit on ARM

Page 17: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

17www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Portability

VPS – Virtual Performance Solution

• Language:• Std. Fortran (90 or higher)

• Std. C

• Size (Solver core only):• 8000 files

• 2.5 Million lines of code

• Supports all major OS• Unix, Linux, Windows

• Supports all major architecture• ia32, em64t, amd64, ia64, mips, pa-risc, ibm-power, ARM (work in progress)

• MPI Supports by dynamic load module• One executable – multiple MPI runtimes

• All major interconnect• Ethernet, Myrinet, Infiniband, Omnipath

Page 18: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

18www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Performance characteristics

VPS – Virtual Performance Solution

• Main code sections (for typical explicit crash/safety simulation):• Internal forces (FE) 50-65%

• Cache friendly (scatter-gather to local arrays)

• Many operations with few global data

• Mainly direct addressing on static data

• Scales very well for SMP and DMP

• Explicit time integration 15-20%• Few and cheap operations on large global arrays

• Direct addressing on static data

• Most demanding option for memory bandwidth

• Scalability in DMP and SMP limited by hardware

• Contact 15-40%• Indirect addressing on unstructured dynamic data

• Difficult for SMP, very difficult for DMP

• Scales reasonable well for DMP and SMP

Page 19: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

19www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Performance characteristics

VPS – Virtual Performance Solution

• A Crash/Safe simulation model is a big challenge for HPC and parallelization:

• Very feature rich and inhomogeneous input definition

• All kind of elements

• All kind of material models

• Complicated contact

• Particle methods (FPM, SPH)

• Exotic safety options (slipring, airbag, MBSYS, …)

-> Flat profile – no real hotspots

• Big risk to hit

• Load unbalance (dynamic or static)

• Serial code (keep in mind Amdahl)

Page 20: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

21www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Agenda

VPS and Scalability on ARM

• Overview and introduction

• VPS – Major characteristics and typical applications

• Tuning activities – A historical overview

• HPC relevant code characteristics

• VPS explicit on ARM

• Compiler version comparison – Runtime and compilation speed

• Compiler options and runtime

• Scalability

Page 21: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

22www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Testing the latest arm64 compilers, options, and scalability

VPS Arm Porting and Compilers Study

• Compilers: ArmCC-19.0, ArmCC-19.1, ArmCC-19.2, GCC-8.3.0, GCC-9.1.0

• Options: Flags recommended by PRACE Best Practices Guide, flags recommended by hardware provider

• Scalability: Up to 96 cores

• Test cases: Neon and Lupo explicit crash simulation

Page 22: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

23www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Machine configuration and compiler options

Test Setup

Configuration Optionsarmcc-19.0-pflag "-march=armv8.2-a -mcpu=cortex-a72 -O3 -fcray-pointer -fopenmp -fomit-frame-pointer"

armcc-19.0-hflag"-march=armv8.2-a+lse+crc+crypto+fp+simd -mcpu=cortex-a72 -O3 -ffp-contract=fast -funroll-loops -fcray-pointer -fopenmp -fomit-frame-pointer"

armcc-19.1-pflag "-march=armv8.2-a -mcpu=tsv110 -O3 -fcray-pointer -fopenmp -fomit-frame-pointer"

armcc-19.1-hflag"-march=armv8.2-a+lse+crc+crypto+fp+simd -mcpu=tsv110 -O3 -ffp-contract=fast -funroll-loops -fcray-pointer -fopenmp -fomit-frame-pointer"

armcc-19.2-pflag "-march=armv8.2-a -mcpu=thunderx2t99 -O3 -fcray-pointer -fopenmp -fomit-frame-pointer"

armcc-19.2-hflag"-march=armv8.2-a+lse+crc+crypto+fp+simd -mcpu=thunderx2t99 -O3 -ffp-contract=fast -funroll-loops -fcray-pointer -fopenmp -fomit-frame-pointer"

gcc-8.3.0-pflag"-march=armv8.2-a+crc+simd -mtune=cortex-a72 -O3 -fcray-pointer -fopenmp -floop-optimize -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer"

gcc-8.3.0-hflag "-march=armv8.2-a+crc+simd -mtune=cortex-a72 -O3 -fcray-pointer -fopenmp -fbacktrace -ffpe-summary=none"

gcc-9.1.0-pflag"-march=armv8.2-a -mtune=tsv110 -O3 -fcray-pointer -fopenmp -floop-optimize -falign-loops -falign-labels -falign-functions -falign-jumps -fomit-frame-pointer"

gcc-9.1.0-hflag "-march=armv8.2-a+crc+simd -mtune=tsv110 -O3 -fcray-pointer -fopenmp -fbacktrace -ffpe-summary=none"

CPU RAM OS / Kernel MPI Implementation VPS Version

2 X Hi1620 48p (Kunpeng 920)

256GB DDR4 2666MHz

EL 7.6 / 4.14.0-115 HPC-X 2.3.0 (OpenMPI 4.1.0a1)

2018

Page 23: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

24www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Compiled executable runtime comparison, 96p, 5 run average

Test Results

249,94251,42

227,96229,36

226,52 227,12

251,7 251,12

244,76 244,96

210

215

220

225

230

235

240

245

250

255

Elap

sed

Tim

e (s

)

Configuration

VPS Lupo Simulation

163,98 164,38

153,06 153,48

149,18

151,36

154,16 154,72153,88

152,76

140

145

150

155

160

165

170

Elap

sed

Tim

e (s

)

Configuration

VPS Neon Simulation

Page 24: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

25www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Scalability and compilation speed

Test Results

76,7767359980,8189224

0

10

20

30

40

50

60

70

80

90

100

Neon Lupo

Scal

abili

ty %

Case

VPS 48p to 96p Scalability

350

322 318

167 173

0

50

100

150

200

250

300

350

400

ArmCC-19.0 ArmCC-19.1 ArmCC-19.2 GCC-8.3.0 GCC-9.1.0

Tim

e (s

)

Compiler

VPS Compilation Time

Page 25: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

26www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved.

Observations from experiment

Summary

• Meaningful improvement in runtime with ArmCC-19.1/19.2 compared to 19.0

• ArmCC 19.2 executable was faster than GCC 9.1.0, with the margin depending on test case (Neon ~2%, Lupo ~8%)

• Significantly faster compilation speed with GCC compared to ArmCC

• Very small difference in runtime across tested compiler option sets, usually ~1% (within scatter)

• Scaling is limited by our machine’s memory bandwidth. Previously observed ~85% (+10%) scaling up to 96 procs with Neon case on machine with faster RAM

• Promising overall performance, Huawei machine produced similar runtime to dual socket Xeon Gold 6150 machine

Page 26: VPS and Scalability on ARM · •VPS –Major ... • Few and cheap operations on large global arrays • Direct addressing on static data • Most demanding option for memory bandwidth

27www.esi-group.com

Copyright © ESI Group, 2019. All rights reserved. Arm HPC User’s Group at ISC 19 Greg Samonds and Thorsten Queckboerner

Thanks