Porting LS-DYNA to ARM · LSTC and LS-DYNA LSTC was founded in 1987 by John O. Hallquist to...

12
Porting LS-DYNA to ARM Sheng Peng, LSTC

Transcript of Porting LS-DYNA to ARM · LSTC and LS-DYNA LSTC was founded in 1987 by John O. Hallquist to...

Porting LS-DYNA to ARM

Sheng Peng, LSTC

LSTC and LS-DYNA● LSTC was founded in 1987 by John O. Hallquist to

commercialize as LS-DYNA the public domain code that originated as DYNA3D. DYNA3D was developed at the Lawrence Livermore National Laboratory, by LSTC’s founder, John O. Hallquist.

● LS-DYNA is a general-purpose finite element program capable of simulating complex real world problems. It is used by the automobile, aerospace, construction, military, manufacturing, and bioengineering industries.

● Supports many hardware platform and OSes.

Processor Operating SystemAMD Opteron, EPYC LinuxCRAY XD1 LinuxHP PA-8X00 HP-UX 11.11 and aboveHP IA-64 HP-UX 11.22 and aboveHP Opteron Linux CP4000/XCIBM Power 4/5 and 8 Linux, AIXINTEL IA64 LinuxINTEL Xeon Linux

Windows 64 bitNEC SX6 Super-UXSGI Mips IRIX 6.5 XSGI IA64 SUSE 9 w/Propack 4

RedHat w/Propack 3SUN Sparc 5.8 and aboveSUN x86-64 5.8 and above

Issues encountered

● Some preprocessor macros that are not strictly spec-compliant have to be corrected to be acceptable to armclang

● OpenMPI 3.1.x vader bug coincided with our initial effort and created some headache. Not as performant as Platform MPI or Intel MPI.

● Intel compilers are more permissive● Code has been tuned for years targeting intel’s

platform

Benchmark problem: NEON Refined Revised

ARM forge profile

● Flatish profile, no obvious hot spot● To be taken with a large grain of salt...

Testing machine #1

● Thread(s) per core: 4● Core(s) per socket: 28● Socket(s): 2● -O3 -fstack-arrays● -mcpu=native

Testing machine #1: preliminary numbers

MPI Ranks w/o OpenMP w/ OpenMP (4 cpus per rank)

2 7887 5690

4 3876 2870

8 3520 (?) 1549

16 1935 891

32 1070 524

● -Ofast on par with -O3

Testing machine #2

● Thread(s) per core: 1● Core(s) per socket: 16● Socket(s): 4● -O3 -fstack-arrays● -mcpu=native

Testing machine #2: preliminary numbers

MPI Ranks w/o OpenMP w/ OpenMP (4 cpus per rank)

4 2477 1492

8 1409 819

16 987 505

32 567 N/A

64 364 N/A

Some thoughts on porting to ARM

● More cores & more threads● Great memory bandwidth● Socket vs. socket● TOC comparison

Thank you!

Sheng Peng, LSTC