Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory...
Transcript of Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory...
![Page 1: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/1.jpg)
Copyright 2019 FUJITSU LIMITED
Performance Evaluation of SVE Enabled Arm Processor A64FX using Variable Vector Length
Shinji SumimotoFujitsu Limited
0
Arm Research Summit 2019
![Page 2: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/2.jpg)
Overview
Background
A64FX Overview
Application Characteristics: Compute Intensive vs. Memory Intensive
Preliminary Performance Evaluation
1 Copyright 2019 FUJITSU LIMITED
![Page 3: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/3.jpg)
Background
A64FX is the First SVE enabled Arm Processor in the world. SVE realizes single binary for multiple vector length environment in order to
support application binary portability.
A64FX supports 512, 384, 256, 128 bit vector length.
HPC applications have several characteristics such as compute intensive and/or memory intensive. SVE enabled processor can control execution vector length at runtime.
A64FX has a memory bandwidth controlling feature at runtime.
No re-compilation is needed for the above executions.
Therefore, we have evaluated several application benchmarks in order to clarify application characteristics
2 Copyright 2019 FUJITSU LIMITED
![Page 4: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/4.jpg)
Inheriting Fujitsu HPC CPU technologies with commodity standard ISA
A64FX: High Performance Arm CPU
Copyright 2019 FUJITSU LIMITED3
![Page 5: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/5.jpg)
High Performance Arm CPU “A64FX”
Architecture featuresISA Armv8.2-A (AArch64 only) SVE (Scalable Vector Extension)
SIMD width 512-bit
Precision FP64/32/16, INT64/32/16/8
# of cores 48 computing cores + 4 assistant cores (4 CMGs)
Memory HBM2: Peak B/W 1024 GB/s
Interconnect TofuD: 28 Gbps x 2 lanes x 10 ports
Copyright 2019 FUJITSU LIMITED
4
HBM2
HBM2
HBM2
HBM2
TofuDController
PCIeController
Net
wor
k on
Ch
ip
CMG(Core Memory Group)specification
13 coresL2 Cache 8 MiB
Mem 8 GiB, 256 GB/s
TofuD28 Gbps x 2 lanes x 10 ports
I/OPCIe Gen3 16 lanes
Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core
Core Core Core Core
Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core Core Core
Core Core Core Core Core Core Core Core Core Core
Core Core Core Core
L2Cache
L2Cache
L2Cache
L2Cache
HB
M2
Interfa
ceH
BM
2 Interfa
ce
HB
M2
inte
rfa
ceH
BM
2 In
terf
ace
PCIe InterfaceTofuD Interface
RIN
G-B
us
![Page 6: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/6.jpg)
HBM2 8GiBHBM2 8GiBHBM2 8GiB
Extremely high bandwidth• Asynchronous Processing in cores, caches and memory controllers
• Maximizing the capability of each layer’s bandwidth
Performance
>2.7TFLOPS
A64FX Memory System
Copyright 2019 FUJITSU LIMITED
CMG
L1 Cache
>11.0TB/s (BF= 4)
L2 Cache
>3.6TB/s (BF = 1.3)
L1D 64KiB, 4way
512-bit wide SIMD 2x FMAs
Core Core CoreCore
>230GB/s
>115GB/s
12x Computing Cores + 1x Assistant Core
Memory
1024GB/s (BF =~0.37)
>115GB/s
>57GB/s
HBM2 8GiB
L2 Cache 8MiB, 16way
256GB/s
5
![Page 7: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/7.jpg)
A64FX CPU Performance Evaluation
Over 2.5x faster in HPC & AI benchmarks than SPARC64 XIfx
Copyright 2019 FUJITSU LIMITED6
AHUG@ISC19 Workshop
![Page 8: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/8.jpg)
A64FX Performance Comparison(1/2)
Himeno Benchmark (Fortran90)
7 Copyright 2019 FUJITSU LIMITED
† “Performance evaluation of a vector supercomputer SX-aurora TSUBASA”,SC18, https://dl.acm.org/citation.cfm?id=3291728
AHUG@ISC19 Workshop
![Page 9: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/9.jpg)
A64FX Performance Comparison(2/2)
Copyright 2019 FUJITSU LIMITED
WRF: Weather Research and Forecasting model Vectorizing loops including IF-constructs is key optimization
Source code tuning using directives promotes compiler optimizations
xx
8
AHUG@ISC19 Workshop
![Page 10: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/10.jpg)
Application Characteristics: Compute Intensive vs. Memory Intensive
Copyright 2019 FUJITSU LIMITED9
![Page 11: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/11.jpg)
SVE: Scalable Vector Length
Copyright 2016 FUJITSU LIMITED
ISA does not fix Vector Length
SVE supports VL from 128 to 2048 bit with multiples of 128 bit
VL is set by processor before executing a binary dynamically
Single execution binary can be executed on processors with multiple VLs
Vector-Length Agnostic(VLA) programing enables ABI Compatibility
SVE SVE
512bit SIMD 256bit SIMD
Execution Binary does not depend on processor’s VL
Execution Binary Portability
Execution Binary/a.out
10
Reducing dynamic instruction steps to half
Increasing dynamic instruction steps to double
![Page 12: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/12.jpg)
Application Characteristics: How to investigate?
Application Characteristics
Computing Intensive
Memory Bandwidth Intensive
How to investigate?
Using performance profiling tools: Arm Allinea Studio
Re-compiling with different compiler option and running
A64FX can help to evaluate the characteristics easily
Compute Intensive Analysis: Changing Vector Length
Memory Intensive Analysis: Changing Memory Access Gap
11 Copyright 2019 FUJITSU LIMITED
![Page 13: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/13.jpg)
Preliminary Performance Evaluation
Copyright 2019 FUJITSU LIMITED12
![Page 14: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/14.jpg)
Benchmark Applications, Evaluation Environment and Evaluation Pameters Benchmark Applications STREAM(TRIAD)
DGEMM
Himeno Benchmarks
NAS Parallel Benchmarks OMP Class C (EP, CG, LU, FE, IS, MG, BT, SP)
Evaluation Environment Hardware: Fugaku-Prototype System with A64FX (single node)
Compiler: Fujitsu Compiler (development version)
Evaluation Parameters SVE Vector Length: 128 - 512
•Using a command with prctl(PR_SVE *)
Memory Bandwidth: 100%-20%
•Using an option of submission job
13 Copyright 2019 FUJITSU LIMITED
Application VL 128 VL256 VL512
HBM100%
HBM 80%
HBM 60%
HBM 40%
HBM 20%
* https://lwn.net/Articles/717804/
![Page 15: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/15.jpg)
STREAM TRIAD: 1 Thread vs. 12 Threads(1CMG)
14 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
STREAMS Triadd 1 thread
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
STREAMS Triadd 12 threads
512 256 128Memory
BandwidthMemory
Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
![Page 16: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/16.jpg)
0
20
40
60
80
100
120
140
160
180
200
0% 20% 40% 60% 80% 100% 120%
STREAMS Triad 1 thread
512 256 128
0
20
40
60
80
100
120
140
160
180
200
0% 20% 40% 60% 80% 100% 120%
STREAMS Triad 12 threads
512 256 128
STREAM TRIAD: 1 Thread vs. 12 Threads-SIMD Effects
15 Copyright 2019 FUJITSU LIMITED
Relative Performance VL128-HBM 100% = 100
Memory Bandwidth
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
![Page 17: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/17.jpg)
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
DGEMM 12 threads
512 256 128
DGEMM: 12 Threads(1CMG)
16 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Vector Length
![Page 18: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/18.jpg)
Himeno Benchmark: 1 CMG vs. 4 CMG
17 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
Himeno OpenMP 12 thread
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
Himeno OpenMP 48 thread
512 256 128Memory
BandwidthMemory
Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
![Page 19: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/19.jpg)
Himeno Benchmark: 1 CMG vs. 4 CMG-SIMD Effects
18 Copyright 2019 FUJITSU LIMITED
Relative Performance VL128-HBM100% = 100
0
50
100
150
200
250
300
0% 20% 40% 60% 80% 100% 120%
Himeno OpenMP 12 thread
512 256 128
0
50
100
150
200
250
300
0% 20% 40% 60% 80% 100% 120%
Himeno OpenMP 48 thread
512 256 128
![Page 20: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/20.jpg)
NPB: with No SMID Effect and Compute Intensive
19 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
EP
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
IS
512 256 128Memory Bandwidth
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
![Page 21: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/21.jpg)
NPB: with SMID Effect and Compute Intensive
Relative Performance VL512-HBM100% = 100
20 Copyright 2019 FUJITSU LIMITED
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
CG
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
BT
512 256 128Memory
BandwidthMemory
Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
![Page 22: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/22.jpg)
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
SP
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
MG
512 256 128
NPB: with SMID Effect and Compute Intensiveand Memory Intensive(1) Relative Performance VL512-HBM100% = 100
21 Copyright 2019 FUJITSU LIMITED
Memory Bandwidth
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
![Page 23: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/23.jpg)
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
FT
512 256 128
0
20
40
60
80
100
120
0% 20% 40% 60% 80% 100% 120%
LU
512 256 128
NPB: with SMID Effect and Compute Intensiveand Memory Intensive(2)
22 Copyright 2019 FUJITSU LIMITED
Relative Performance VL512-HBM100% = 100
Memory Bandwidth
Memory Bandwidth
Rel
ativ
e Pe
rfor
man
ce
Rel
ativ
e Pe
rfor
man
ce
Vector Length Vector Length
![Page 24: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/24.jpg)
Summary
Arm SVE provides Application Binary Portability Variable Vector Length is defined at runtime
SVE feature can be used application performance analysis Reduction of vector length shows application is compute intensive or not
Benchmark Evaluation on A64FX Enables: Compute Intensive Analysis: Changing Vector Length
Memory Intensive Analysis: Changing Memory Access Gap
23 Copyright 2019 FUJITSU LIMITED
![Page 25: Performance Evaluation of SVE Enabled Arm Processor A64FX … · 2019-10-02 · Title: A Memory Saving Communication Method Using Remote Atomic Operations Author: Shinji Sumimoto](https://reader033.fdocuments.in/reader033/viewer/2022053000/5f04a2f47e708231d40ef499/html5/thumbnails/25.jpg)
24 Copyright 2019 FUJITSU LIMITED