Timing-agnostic SVE Analysis of geophysics kernel

18
© 2020 Arm Limited (or its affiliates) Timing-agnostic SVE Analysis of geophysics kernel

Transcript of Timing-agnostic SVE Analysis of geophysics kernel

© 2020 Arm Limited (or its affiliates)

Timing-agnostic SVE Analysis of

geophysics kernel

2 © 2020 Arm Limited (or its affiliates)

Background on geophysical stencils

• Coupling of HPC and Numerical methods approaches : Mercerat et al. (2009), Breuer et al. (2016).

• Performance characterization and projection : C.Andreolli et al. (2011), R.Cruz & M.Araya-Polo (2011).

• Impact of the Absorbing Boundary Conditions and integration in complex applications : M.Christen et al.(2011)

Sync/Comms-avoiding strategies

• Blocking (mainly spatial) but traction for temporal algorithms

• Runtime systems (e.g. task-based)

Refs✓D.Orozco & G.Guao (2009), K.Datta et al. (2009),

Malas et al. (2015).✓ V.Martinez et al. (2015), L.Boilot et al.(2016).

Leveraging hardware features

• Heterogeneous architectures.• SIMD.• Mixed-precision (converged

architectures).

Refs✓ P.Micikevicious (2009), R.Abdelkhalak et al.

(2012), I.Said et al. (2018).✓G.Fabiel-Ouellet (2018).

Productivity

• High-level, DSL-like approaches(e.g. Devito, Patus, Yask)

• Directive-based, source-to-source• Auto-tuning, machine learning

Refs✓ Christen et al. (2012), C.Yount et al.(2016),

M.Louboutin et al.(2016), ✓ B.Videau et al.(2018).

3 © 2020 Arm Limited (or its affiliates)

Marvell(Cavium)

Ampere(X-Gene)

Fujitsu

Huawei(HiSilicon)

Amazon(Annapurna)

EPI / SiPearl

Other

AltraeMag

1616

PHYTIUM

RHEA

Arm is ubiquitous

4 © 2020 Arm Limited (or its affiliates)

PCleController

TofuInterface

C

C

C

C

NOC

HB

M2

HB

M2

HB

M2

HB

M2

CMG CMG

CMG CMG

CMG:Core Memory Group NOC:Network on Chip

Arm is ubiquitous

• Four NUMA Regions

• SVE-capable (512-bits)

• Various interfaces (more hierarchy)

• Diversity of accelerators

5 © 2020 Arm Limited (or its affiliates)Arithmetic Intensity (Flop/Byte)

4096

1024

256

64

16

40.25 1 644 16 256 1024

Theoretical roofline

Data from publicly available performance information.

6 © 2020 Arm Limited (or its affiliates)

• There is no preferred vector length• The vector length (VL) is a hardware choice, 128-2048b, in increments of 128b• A Vector Length Agnostic (VLA) programming adjusts dynamically to the available VL

• SVE addresses traditional barriers to auto-vectorization • Software-managed speculative vectorization of uncounted loops• Extract more data-level parallelism (DLP) from existing C/C++/Fortran source code

• SVE is a new approach to vectorization, not an iteration on existing ISAs (e.g. NEON)• SVE is a separate, optional extension with a new set of instruction encodings• Initial focus is HPC and general-purpose server, not media/image processing

What makes it a Scalable Vector Extension?

7 © 2020 Arm Limited (or its affiliates)

Dynamic Binary instrumentation

DynamoRIO

Armv8-A + SVE Binary

ArmIE(emulation client)

Emulation API

SVE Memtrace Client

SVE Inscount Client

SVE custom clients

Opcodes Client

• Diversity of languages, frameworks, dependencies

• Region of Interest feature for full-fledged code

Arm Research : Asvie: A Timing-Agnostic SVE Optimization Methodology (R.Rusitoru et al., IEEE SC19)

8 © 2020 Arm Limited (or its affiliates)

Walk through – source code modification

• Region of Interest feature to capture hotspots : _START_TRACE() / _STOP_TRACE()

• Standard pragmas, keywords to enhance vectorization (e.g. from LLVM)

9 © 2020 Arm Limited (or its affiliates)

Walk through - vectorization

• Add “+sve” to generate SVE instructions

• Use “–Rpass” or “opt-report” flag for compiler insights

10 © 2020 Arm Limited (or its affiliates)

Walk through - scripts

• Same binary, varying vector length

• Various post-processing scripts are included : “merge, analyze, flops-bytes …”

11 © 2020 Arm Limited (or its affiliates)

Walk through - metrics

SVE Mem operations

SVE instr. breakdown

Native instr. breakdown

12 © 2020 Arm Limited (or its affiliates)

NEON Vectorization study

• NEON (128-bits)

• LLVM and GNU toolchain.

• Speedup up to x3.7

Acoustic stencil - 8-th order (ISO)

13 © 2020 Arm Limited (or its affiliates)

Dynamic instructions breakdown

• Same binary for all SVE runs

• Impact of Amdhal law for large vector length

Implementation-dependent

Acoustic stencil - 8-th order (ISO)

52 %

85 %

36 %

67 %

78 %

14 © 2020 Arm Limited (or its affiliates)

Acoustic stencil - 8-th order (ISO)

SVE instructions breakdown

• Decrease of instruction count as we increase the vector length

• Ratios (Floating Point/ Memory or Loads/Stores instructions).

Stencil-based kernels characterization

15 © 2020 Arm Limited (or its affiliates)

Acoustic stencil - 8-th order (ISO)

Scalar instructions breakdown

• Instructions count reduction with larger SVE vector.

Fewer loop iteration (e.g. conditional branch – bcond)

16 © 2020 Arm Limited (or its affiliates)

SVE Vector lane utilization

• Number of bytes transferred

• Very useful for application characterization (SIMD lanes, scatter/gather operations ….)

Acoustic stencil - varying order (ISO)

© 2020 Arm Limited (or its affiliates)

Thank YouDankeMerci谢谢

ありがとうGracias

Kiitos감사합니다

धन्यवाद

شكرًاধন্যবাদתודה

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in

the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks

© 2020 Arm Limited (or its affiliates)