INSPECT - SC19sc19.supercomputing.org/proceedings/tech_poster/...IACA Output How to test this...

1
Intranode Stencil Performance Evaluation Collection Julian Hammer <[email protected]>, Julian Hornich, Georg Hager, Thomas Gruber, Gerhard Wellein INSPECT Supported byt the Federal Ministry of Education and Research (METACCA) The Broadwell CPU has two NUMA domains in Cluster-on-Die mode, each with 9 cores. The code saturates at 5 cores and scales to both NUMA domains. Downclocking from 2.3 to 1.3 GHz, 5/9th of the nominal frequency, would save energy without compromising performance. Multicore scaling is well predict- ed. Single core is based on ineffi- cient assembly. Transfers can not be measured on AMD Zen, since no suitable counters have been published. As for Broadwell, IACA pre- dicts T Reg-L1 too pessimistic. Considering this, the new cache non-inclusive hierar- chy traffic is well predicted by the cache simulator. To allow reproduction of INSPECT results, the entier workflow is automated and published open source on github. INSPECT depends on the following tools: STEMPEL (for stencil code generation) and pycachesim (for cache simulation). Additionally it depends on IACA, for Intel architectures, and requires an optimizing compiler. LCs are an analytic cache reuse model to estimate inter-cache and memory transfer volumes. The hits for each condition in the radius 3 stencil have been colored coded here: dimensions radius 2D/r1/box 2D/r1/box 2D/r1/box 2D/r1/box coefficient weighting homo- geneous hetero- geneous point- symmetric isotropic coefficient type variable constant datatype float double float _Complex double _Complex 2D/r1/star 3D/r1/star 2D 3D stencil type 2D/r1/star star 2D/r1/box box 3D/r2/star r1 r2 2D/r3/star r3 2D/r1/box Stencil Classification Compiler Forensics Counting Complex FLOPs Reproducibility Machine Description To structure the library, stencils are classified along the following scheme: STEMPEL[1] is used to generate example codes based on this classification. Feature-Architecture Matrix Floating point operations per lattice-side update (FLOP per LUP) are counted with regard to the high level code. Adding up to 60 real FLOP per LUP. Each micro-architecture requires a model definition, cache hierarchy de- scription and benchmark data. This information is also useful for general reference when interpreting and comparing benchmark results. The Broadwell Execution-Cache-Memory (ECM) model is presented here, all parameters may be seen on the INSPECT website. Allows for assembly code inspection, details on compiler and options are provided in “System Information” and defined by the machine de- scription. Here icc Version 19.0.2 was used. Transfer Volumes Model vs Measurement Visualization Cycles per cacheline throughputs as used as base unit. Here, a cacheline equals to four iterations, because four complex double elements fit in one cacheline. Lower is faster! For comparison, lattice updates per second have been added. ECM and Roofline models are build with the layer condition or cache simulation (plots left and center). A phenomenological ECM model is con- structed from measured data transfers and port occupations (right). One may conclude that spatial blocking with 300 elements on the inner two dimensions would be- neificial. INSPECT is based on STEMPEL and Kerncraft, which in turn is based on IACA, OSACA, py- cachesim and LIKWID. This tables provides an overview of the current architecture support: Outlook Improve internal workflow, improve on visualization and reporting, include support for ARM, add more architectures, make machine description more versatile, include sup- port penalty cycles on all levels of memory hierarchy. References [1] Guerrera. 2019. STEMPEL: Stencil TEMPlate Engineering Library. https://github.com/RRZE-HPC/stempel [2] Hammer et al. 2017. Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels. Tools for High Performance Computing 2016, doi:10.1007/978-3-319-56702-0_1 [3] Hornich et al. 2019. Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT. arXiv:1906.08138 [4] Luszczek et al. 2005. Introduction to the HPC Challenge Benchmark Suite. doi:10.1.1.112.1817 [5] Meuer et al. 2014. The TOP500: History, Trends, and Future Directions in High Performance Computing (1st ed.). Chapman & Hall/CRC. [6] SPEC. 2019. Standard Performance Evaluation Corporation. https://www.spec.org [7] Stengel et al. 2015. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache- Memory Model. Proceedings of the 29th ACM on International Conference on Supercomputing - ICS ’15. doi:10.1145/2751205.2751240 [8] Williams et al. 2009. Roofline. Commun. ACM 52, 4 (Apr 2009). doi:10.1145/1498765.1498785 [9] McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter. [10] Laukemann et al. 2018. Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures. In 2018 IEEE/ACM Perfor- mance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). doi:10.1109/PMBS.2018.8641578 [11] Treibig et al. 2010. LIKWID: A Lightweight Performance- Oriented Tool Suite for x86 Multicore Environments. In 2010 39th International Conference on Parallel Processing Workshops. doi:10.1109/ ICPPW.2010.38 [12] Franchetti et al. 2018. SPIRAL: Extreme Performance Portability. In Proceedings of the IEEE, Special Issue on From High Level Specifications to High Performance Code, 2018. doi:10.1109/JPROC.2018.2873289 Related Work We are unaware of any project that aims to build a performance reference library includ- ing modelling, measurements and performance counters. SPEC[6], HPCC[4], STREAM[9], among others, do not consider modelling nor anything near the detail. While SPEC is very detailed on runtime information, which is often helpful, but vital in- formation such as the executed assembly is missing. None of the named tools are meant to act as a framework for user provided codes. Other frameworks, such as SPI- RAL[12] focus on optimation rather than modelling and explaining behavior. Layer Conditions Stacked Single-core Plots Multicore Scaling https://git.io/fjMHx Interactive LC Calc- ulator, prefilled with 3D/r3/star stencil: https://git.io/fjMDb Broadwell Description Modelling and presenting performance data–even for simple kernels such as stencils–is not triv- ial. We therefore present an overview on how to interpret and what to learn from an INSPECT report, as well as highlighting best practices for performance data reporting. To make a performance report comprehensible, the following need to be answered: > What type of code and underlying method is used? > What did the compiler make of the code? > What are the expected memory and cache data volumes? > What is the expected maximum computational performance? > What is the prediction of a performance model? > How fast does it actually run? > How does it scale across cores, NUMA domains and sockets? > How do data volumes compare to predictions? > What are the steps needed to reproduce? Tool Analysis Intel pre- Skylake Intel Skylake AMD Zen ARM ThunderX2 OSACA Instruction Throughput pycachesim Hit/Miss Prediction Kerncraft Compute w/ OSACA OSACA Compute w/ IACA Data transfers with pycachesim Data transfers with Layer Conditions INSPECT Layer Conditions Single Core Grid Scaling (Cache Simulator) OSACA Single Core Grid Scaling (Phenomenological) LIKWID Data Transfers between Caches LIKWID Multi Core Thread Scaling OSACA work-in-progress no support for non-Intel architec- tures and discontin- ued development depends on LIKWID or OSACA support insufficient or partial hardware counter support Documentation Microbenchmarks L1 L2 L3 MEM T OL T Reg-L1 T L1-L2 T L2-L3 T L3-MEM Registers Execution max sum j k i Compulsory misses / full-caching Counted Underlying operation Real FLOP executed (2) (6) Next neighbor caching Line caching 2D layer caching T OL Roofline prediction Reciprocal Throughput [cycle / work unit] Measurement ECM prediction T L3-MEM T L2-L3 T L1-L2 T Reg-L1 Performance [work / time] Validation of data volumes is accomplished with hardware performance counters, using LIKWID. The cache simulator captures some artifacts and transition behavior. The layer condition model is very accu- rate in steady states. Kerncraft uses the following machine specific hardware performance counters for Broadwell: L1 L2 L3 MEM Registers LOAD counters STORE counters Report for Skylake X https://git.io/fjDxK Report for AMD Zen 2 4 6 8 10 12 14 16 18 0 50 100 150 200 250 300 350 400 450 Benchmark ECM LC Prediction Roofline LC Prediction Number of Threads Performance @ 1.3 GHz [MLUP/s] https://git.io/fjy3g LIKWID TOOLS [2] (for performance modelling, evaluation and benchmarking) [11] (for performance measurements and machine model extraction) [10] (for AMD in-core predictions) ACA OS IACA Output How to test this stencil and replicate the data Layer Condition Data Cache Simulator Data Generate analyzable stenci code: stempel gen -D 3 -r 3 -t "double _Complex" -C constant -k star -i --store stencil.c [ output of the IACA, OSACA or LLVM-MCA tool ] System Information [ output of likwid-topology and other utilities to document and record the machine configuration ] and compilable benchmark code with: stempel bench stencil.c -m BroadwellEP_E5-2697_CoD.yml --store Multicore scaling: Complete analysis report available here! https://git.io/fjMDd 0 200 400 600 800 1000 0 100 20 40 60 80 120 140 Cycles / Cacheline Grid Size (N=M=P) Multicore Thread Scaling 2 4 6 8 10 12 14 16 18 0 50 100 150 200 250 300 350 400 450 Benchmark ECM LC Prediction Roofline LC Prediction Number of Threads Performance [MLUP/s] Single Core Grid Scaling Analytical ECM and Roofline Model Prediction vs. Performance 0 200 400 600 800 1000 0 20 40 60 80 100 120 140 Grid Size (N=M=P) Cycles / Cacheline Modelled with layer conditions: 0 200 400 600 800 1000 0 20 40 60 80 100 120 140 Grid Size (N=M=P) Cycles / Cacheline Phenomenological ECM model: T L3-MEM T L2-L3 T L1-L2 T Reg-L1 T OL Benchmark Roofline Data Transfers between Caches 0 200 400 600 800 1000 0 20 40 60 80 100 120 140 Grid Size (N=M=P) Data Transfers [Byte/LUP] Grid Size (N=M=P) Data Transfers [Byte/LUP] Modelled with cache simulation: Modelled with layer conditions: L1-L2 measured L2-L3 measured L3-MEM measured L1-L2 predicted L2-L3 predicted L3-MEM predicted MLUP / second 230 115 76.7 57.5 46.0 38.3 32.8 Modelled with cache simulation: 0 200 400 600 800 1000 0 20 40 60 80 100 120 140 Layer Conditions LC for L1 cache with 32 KB: condition misses hits 0 20 2 18 6 14 8 12 12 8 else 14 6 LC for L2 cache with 256 KB: condition misses hits 0 20 2 18 6 14 8 12 12 8 else 14 6 LC for L3 cache with 22.5 MB: condition misses hits 0 20 2 18 6 14 8 12 12 8 else 14 6 1 miss = 16 Byte / LUP 576 288 Kernel Source Code C Code (generated or hand-written): Assembly (generated by compiler): Stencil Properties dimension 3D radius 3 weighting isotropic kind star coefficients constant datatype double _Complex machine BroadwellEP_E5-2697_CoD FLOP per LUP 22 Performance Model and Benchmark Report Same single NUMA and full-socket performance, at lower clock speed ECM too pessimistic due to T Reg-L1 overestimation by IACA T Reg-L1 difference shows over- estimation by IACA for ECM Main memory transfer stream penalties, shows with lower T Reg-L1 Transition to next layer condi- tion (line caching in L2 at ~1170)

Transcript of INSPECT - SC19sc19.supercomputing.org/proceedings/tech_poster/...IACA Output How to test this...

Page 1: INSPECT - SC19sc19.supercomputing.org/proceedings/tech_poster/...IACA Output How to test this stencil and replicate the data Layer Condition Data Cache Simulator Data Generate analyzable

Intranode Stencil Performance Evaluation Collection

Julian Hammer <[email protected]>, Julian Hornich, Georg Hager, Thomas Gruber, Gerhard Wellein

INSPECTSupported byt the Federal Ministry of Education and Research (METACCA)

The Broadwell CPU has two NUMA domains in Cluster-on-Die mode, each with 9 cores. The code saturates at 5 cores and scales to both NUMA domains.Downclocking from 2.3 to 1.3 GHz, 5/9th of the nominal frequency, would save energy without compromising performance.

Multicore scaling is well predict-ed. Single core is based on ineffi-cient assembly. Transfers can not be measured on AMD Zen, since no suitable counters have been published.

As for Broadwell, IACA pre-dicts TReg-L1 too pessimistic. Considering this, the new cache non-inclusive hierar-chy traffic is well predicted by the cache simulator.

To allow reproduction of INSPECT results, the entier workflow is automated and published open source on github. INSPECT depends on the following tools:

STEMPEL (for stencil code generation) and pycachesim (for cache simulation).Additionally it depends on IACA, for Intel architectures, and requires an optimizing compiler.

LCs are an analytic cache reuse model to estimate inter-cache and memory transfer volumes. The hits for each condition in the radius 3 stencil have been colored coded here:

dimensions radius

2D/r1/box

2D/r1/box

2D/r1/box

2D/r1/box

coefficientweighting

hom

o-ge

neou

she

tero

-ge

neou

spo

int-

sym

met

riciso

tropi

c

coefficienttype

variableconstant

datatype

floatdouble

float _Complexdouble _Complex

2D/r1/star

3D/r1/star

2D3D

stencil type

2D/r1/star

star

2D/r1/box

box

3D/r2/star

r1r2

2D/r3/star

r3

2D/r1/box

Stencil Classification

Compiler Forensics

Counting Complex FLOPs

Reproducibility

Machine Description

To structure the library, stencils are classified along the following scheme:

STEMPEL[1] is used to generate example codes based on this classification.

Feature-Architecture Matrix

Floating point operations per lattice-side update (FLOP per LUP) are counted with regard to the high level code.

Adding up to 60 real FLOP per LUP.

Each micro-architecture requires a model definition, cache hierarchy de-scription and benchmark data. This information is also useful for general reference when interpreting and comparing benchmark results.The Broadwell Execution-Cache-Memory (ECM) model is presented here, all parameters may be seen on the INSPECT website.

Allows for assembly code inspection, details on compiler and options are provided in “System Information” and defined by the machine de-scription. Here icc Version 19.0.2 was used.

Transfer Volumes

Model vs Measurement Visualization

Cycles per cacheline throughputs as used as base unit. Here, a cacheline equals to four iterations, because four complex double elements fit in one cacheline. Lower is faster! For comparison, lattice updates per second have been added.

ECM and Roofline models are build with the layer condition or cache simulation (plots left and center). A phenomenological ECM model is con-structed from measured data transfers and port occupations (right).One may conclude that spatial blocking with 300 elements on the inner two dimensions would be-neificial.

INSPECT is based on STEMPEL and Kerncraft, which in turn is based on IACA, OSACA, py-cachesim and LIKWID. This tables provides an overview of the current architecture support:

OutlookImprove internal workflow, improve on visualization and reporting, include support for ARM, add more architectures, make machine description more versatile, include sup-port penalty cycles on all levels of memory hierarchy.

References[1] Guerrera. 2019. STEMPEL: Stencil TEMPlate Engineering Library. https://github.com/RRZE-HPC/stempel[2] Hammer et al. 2017. Kerncraft: A Tool for Analytic Performance Modeling of Loop Kernels. Tools for High Performance Computing 2016, doi:10.1007/978-3-319-56702-0_1[3] Hornich et al. 2019. Collecting and Presenting Reproducible Intranode Stencil Performance: INSPECT. arXiv:1906.08138[4] Luszczek et al. 2005. Introduction to the HPC Challenge Benchmark Suite. doi:10.1.1.112.1817[5] Meuer et al. 2014. The TOP500: History, Trends, and Future Directions in High Performance Computing (1st ed.). Chapman & Hall/CRC.[6] SPEC. 2019. Standard Performance Evaluation Corporation. https://www.spec.org[7] Stengel et al. 2015. Quantifying Performance Bottlenecks of Stencil Computations Using the Execution-Cache- Memory Model. Proceedings of the 29th ACM on International Conference on Supercomputing - ICS ’15. doi:10.1145/2751205.2751240[8] Williams et al. 2009. Roofline. Commun. ACM 52, 4 (Apr 2009). doi:10.1145/1498765.1498785[9] McCalpin. 1995. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter.[10] Laukemann et al. 2018. Automated Instruction Stream Throughput Prediction for Intel and AMD Microarchitectures. In 2018 IEEE/ACM Perfor- mance Modeling, Benchmarking and Simulation of High Performance Computer Systems (PMBS). doi:10.1109/PMBS.2018.8641578[11] Treibig et al. 2010. LIKWID: A Lightweight Performance- Oriented Tool Suite for x86 Multicore Environments. In 2010 39th International Conference on Parallel Processing Workshops. doi:10.1109/ ICPPW.2010.38[12] Franchetti et al. 2018. SPIRAL: Extreme Performance Portability. In Proceedings of the IEEE, Special Issue on From High Level Specifications to High Performance Code, 2018. doi:10.1109/JPROC.2018.2873289

Related WorkWe are unaware of any project that aims to build a performance reference library includ-ing modelling, measurements and performance counters. SPEC[6], HPCC[4], STREAM[9], among others, do not consider modelling nor anything near the detail. While SPEC is very detailed on runtime information, which is often helpful, but vital in-formation such as the executed assembly is missing. None of the named tools are meant to act as a framework for user provided codes. Other frameworks, such as SPI-RAL[12] focus on optimation rather than modelling and explaining behavior.

Layer Conditions

Stacked Single-core Plots

Multicore Scaling

https://git.io/fjMHx

Interactive LC Calc-ulator, prefilled with3D/r3/star stencil:

https://git.io/fjMDb

Broadwell Description

Modelling and presenting performance data–even for simple kernels such as stencils–is not triv-ial. We therefore present an overview on how to interpret and what to learn from an INSPECT report, as well as highlighting best practices for performance data reporting.

To make a performance report comprehensible, the following need to be answered:> What type of code and underlying method is used?> What did the compiler make of the code?> What are the expected memory and cache data volumes?> What is the expected maximum computational performance?> What is the prediction of a performance model?> How fast does it actually run?> How does it scale across cores, NUMA domains and sockets?> How do data volumes compare to predictions?> What are the steps needed to reproduce?

Tool AnalysisIntel pre-Skylake

Intel Skylake

AMD Zen

ARM ThunderX2

OSACA Instruction Throughput

pycachesim Hit/Miss Prediction

Kerncraft

Compute w/ OSACA OSACA

Compute w/ IACA

Data transfers with pycachesim

Data transfers with Layer Conditions

INSPECT

Layer Conditions

Single Core Grid Scaling (Cache Simulator) OSACA

Single Core Grid Scaling (Phenomenological) LIKWID

Data Transfers between Caches LIKWID

Multi Core Thread Scaling OSACA

work-in-progress

no support for non-Intel architec-tures and discontin-ued development

depends on LIKWID or OSACA support

insufficient or partial hardware counter support

Documentation

Microbenchmarks

L1

L2

L3

MEM

TOL

TReg-L1

TL1-L2

TL2-L3

TL3-MEM

Registers

Execution

max

sum

j

k

i

Compulsory misses / full-caching

Counted Underlying operation Real FLOP executed

(2)

(6)

Next neighbor caching

Line caching

2D layer caching

TOL

Roofline prediction

Reci

proc

al T

hrou

ghpu

t [cy

cle

/ wor

k un

it]

MeasurementECM prediction

TL3-MEM

TL2-L3

TL1-L2

TReg-L1

Performance [w

ork / time]

Validation of data volumes is accomplished with hardware performance counters, using LIKWID. The cache simulator captures some artifacts and transition behavior. The layer condition model is very accu-rate in steady states. Kerncraft uses the following machine specific hardware performance counters for Broadwell:

L1

L2

L3

MEM

RegistersLOAD counters STORE counters

Report for Skylake X

https://git.io/fjDxK

Report for AMD Zen

2 4 6 8 10 12 14 16 180

50

100

150

200

250

300

350

400

450

Benchmark ECM LC Prediction Roofline LC Prediction

Number of Threads

Perfo

rman

ce @

1.3

GH

z [M

LUP/

s]

https://git.io/fjy3g

LIKWID

TOOLS[2] (for performance modelling,evaluation and benchmarking)

[11] (for performance measurementsand machine model extraction)[10] (for AMD in-core predictions)

ACAOS

IACA Output

How to test this stencil and replicate the dataLayer Condition Data

Cache Simulator Data

Generate analyzable stenci code: stempel gen -D 3 -r 3 -t "double _Complex" -C constant -k star

-i

--store stencil.c

[ output of the IACA, OSACA or LLVM-MCA tool ]

System Information

[ output of likwid-topology and other utilities to document and record the machine configuration ]

and compilable benchmark code with: stempel bench stencil.c -m BroadwellEP_E5-2697_CoD.yml --store

Multicore scaling:

Complete analysis report available here!

https://git.io/fjMDd

0 200 400 600 800 1000

0

100

20

40

60

80

120

140

Cycl

es /

Cach

elin

e

Grid Size (N=M=P)

Multicore Thread Scaling

2 4 6 8 10 12 14 16 180

50

100

150

200

250

300

350

400

450

Benchmark ECM LC Prediction Roofline LC Prediction

Number of Threads

Perfo

rman

ce [M

LUP/

s]

Single Core Grid Scaling Analytical ECM and Roofline Model Prediction vs. Performance

0 200 400 600 800 1000

0

20

40

60

80

100

120

140

Grid Size (N=M=P)

Cyc

les

/ Cac

helin

e

Modelled with layer conditions:

0 200 400 600 800 1000

0

20

40

60

80

100

120

140

Grid Size (N=M=P)

Cyc

les

/ Cac

helin

e

Phenomenological ECM model:

T L3-MEM T L2-L3 T L1-L2 T Reg-L1T OL BenchmarkRoofline

Data Transfers between Caches

0 200 400 600 800 10000

20

40

60

80

100

120

140

Grid Size (N=M=P)

Data

Tra

nsfe

rs [B

yte/

LUP]

Grid Size (N=M=P)

Data

Tra

nsfe

rs [B

yte/

LUP]

Modelled with cache simulation:Modelled with layer conditions:L1-L2 measured L2-L3 measured L3-MEM measured L1-L2 predicted L2-L3 predicted L3-MEM predicted

MLU

P / second

230

115

76.7

57.5

46.0

38.3

32.8Modelled with cache simulation:

0 200 400 600 800 10000

20

40

60

80

100

120

140

Layer Conditions LC for L1 cache with 32 KB:

condition misses hits

0 20

2 18

6 14

8 12

12 8

else 14 6

LC for L2 cache with 256 KB: condition misses hits

0 20

2 18

6 14

8 12

12 8

else 14 6

LC for L3 cache with 22.5 MB: condition misses hits

0 20

2 18

6 14

8 12

12 8

else 14 6

1 miss = 16 Byte / LUP

576288

Kernel Source Code C Code (generated or hand-written): Assembly (generated by compiler):

Stencil Properties dimension 3D

radius 3

weighting isotropic

kind star

coefficients constant

datatype double _Complexmachine BroadwellEP_E5-2697_CoD

FLOP per LUP 22

Performance Model and Benchmark Report

Same single NUMA and full-socketperformance, at lower clock speed

ECM too pessimistic due to

T Reg-L1 overestimation by IACA

TReg-L1 difference shows over-estimation by IACA for ECM

Main memory transfer streampenalties, shows with lower TReg-L1

Transition to next layer condi-tion (line caching in L2 at ~1170)