An Approach to the Highest Efficiency of the HPCG...

1
0 5 10 15 20 25 30 35 A B C D E F HPCG Results (GFLOP/s) An Approach to the Highest Efficiency of the HPCG Benchmark on the SX-ACE Supercomputer Kazuhiko Komatsu 1)4) , Ryusuke Egawa 1)4) , Yoko Isobe 1)3) , Ryusei Ogata 3) , Hiroyuki Takizawa 2)4) , Hiroaki Kobayashi 1) 1) Cyberscience Center, Tohoku University 2) GSIS, Tohoku University 3) NEC Corporation 4) JST CREST Introduction HPCG (High Performance Conjugate Gradient) has been developed to narrow the large performance gap between real applications and the HPL benchmark. The major features of HPCG are; - Including major communication and computational patterns of real applications. - Easy to understand, optimize, and run. - Able to examine memory and network performances. ex) indirect memory accesses, collective ops, p2p messages Overview of the SX-ACE Supercomputer Preliminary Evaluation before Optimization Performance Evaluation of HPCG on SX-ACE Conclusions Crossbar ADB 1MB MSHR SPU VPU Core 1 MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC MC 256GB/s Core 2 Core 3 Core 0 256GB/s RCU 4GB/s 4GB/s 256GB/s 256GB/s CPU architecture The SX-ACE supercomputer consists of 512 nodes. Each node is equipped with an SX-ACE processor, which can provide a high memory bandwidth for practical HPC applications. - High vector computational performance by a 4-core vector processor - High sustained memory bandwidth by a strong memory subsystem - ADB(Assignable Data Buffer) to keep a high sustained memory bandwidth - Scalable multinode system by a two-stage fat-tree custom network Optimizations of HPCG for SX-ACE <snip> GFLOP/s Summary: Raw DDOT: 28.2595 Raw WAXPBY: 17.8771 Raw SpMV: 0.441609 Raw MG: 0.586906 Raw Total: 0.577329 Total with convergence overhead: 0.577329 <snip> To decide the optimization plan, the reference HPCG is executed. Preliminary Evaluation Environments Supercomputer : SX-ACE # of nodes : 1 Compiler : NEC SX C/C++ Compiler Version 1.0 HPCG version : Release 2.4 Problem size : 104 x 104 x 104 (default) Parallelization : Flat-MPI HPCG-Benchmark-2.4.yaml These low performances are caused by quite low vectorization rates and inefficient memory accesses. Optimiza- tions for efficient vector calculations and memory accesses are essential. Data rearrangement for continuous memory accesses matrixValues[0] matrixValues[1] matrixValues[i] mtxIndL[0] mtxIndL[1] mtxIndL[i] matrixValues[0] matrixValues[1] matrixValues[i] mtxIndL[0] mtxIndL[1] mtxIndL[i] 0 1 2 3 4 1 2 3 4 0 1 2 3 4 0 Data packing for vector-friendly matrix memory allocation of sparse matrices Eight-coloring Parallelization approaches to eliminate data dependencies CSR Default matrix data storage format of HPCG2.4 JAD SpMV function in a highly opti- mized numeri- cal library ELL Well known as suitable for the vector archi- tecture Selective data store into the on-chip memory ADB - Only highly-reusable data are selectively stored in ADB By exploiting programmer's knowledge about the reusability, data can selectively be stored in the on-chip memory ADB of the SX-ACE processor. All data that are considered reusable by the compiler are not always necessary. Problem size tuning for efficient use of ADB To avoid evicting highly-reusable data from ADB, the problem size of HPCG has been tuned under consideration of both the capacity of ADB and the size of hyperplanes. Especially, the size of each dimension becomes important for the hyperplane method. To exploit high potential of the SX-ACE supercomputer on the HPCG benchmark, this poster discusses the optimization techniques; - Data rearrangement for efficient continuous memory accesses - Various sparse matrix memory allocation - the eight-coloring method and the hyperplane method for parallelization - Selectively data stored in ADB, and the problem size tuning. As a result, the SX-ACE supercomputer successfully achieves the highest efficiency of 11.4% in the latest HPCG ranking. * Since an overhead of memory arrangement is too large for the SX-ACE processor, the row data rearrangement was not applied. Performance improvements of the optimizations Effect of selective caching and size tuning Scalability and efficiency Reference CG Timing & Residual Reduction Reference SpMV and MG Kernel Timing Optimized CG Setup Optimized CG Timing and Analysis Report Results Problem Setup Validation Testing Hyperplane 0 2 4 6 8 10 12 14 16 0.0 0.1 1.0 10.0 100.0 1 2 4 8 16 32 64 128 256 512 Efficiency (%) HPCG Results (TFLOP/s) Number of nodes Performance Efficiency Vec. ratios of SpMV and MG become 99.37% and 99.23% A: Reference B: JAD + Coloring C: ELL + Coloring D: ELL + Hyperplane F: ELL + Hyperplane + ADB G: ELL + Hyperplane + ADB + Size tuning HPCG solves a linear system of a sparse matrix. Ax=b A is a large sparse matrix discretized by the finite element method. The linear system is solved by multigrid preconditioned conjugate gradient with the symmetric Gauss-Seidel smoother. IXS network by two-stage fat-tree RTR (spine) RTR (spine) RTR (edge) RTR (edge) plane #0 4GB/s x2 8GB/s x2 RTR (spine) RTR (spine) RTR (edge) RTR (edge) plane #1 node #000 RCU node #015 RCU node #496 RCU node #511 RCU IXS 16 ports 16 ports 32 ports ADB of SX-ACE - Private on-chip memory - 1 MB, 4-way, 16-bank - 256 GB/s bandwidth - Customized for fast random accesses Software controllable function - Compiler and user can specify the use of the ADB - A bypass mechanism for memory instructions - Avoiding cache pollution - Enhancement of indirect memory access 22 24 26 28 30 32 34 88x216x512 96x216x464 104x216x432 112x216x400 120x216x376 MG Results (GFLOP/s) Problem size Selective caching All data caching

Transcript of An Approach to the Highest Efficiency of the HPCG...

Page 1: An Approach to the Highest Efficiency of the HPCG ...sc15.supercomputing.org/sites/all/themes/SC15images/tech...An Approach to the Highest Efficiency of the HPCG Benchmark on the SX-ACE

0

5

10

15

20

25

30

35

A B C D E F

HPC

G R

esul

ts (G

FLO

P/s)

An Approach to the Highest Efficiency of the HPCG Benchmark on the SX-ACE SupercomputerKazuhiko Komatsu1)4), Ryusuke Egawa1)4), Yoko Isobe1)3), Ryusei Ogata3), Hiroyuki Takizawa2)4), Hiroaki Kobayashi1)

1) Cyberscience Center, Tohoku University 2) GSIS, Tohoku University 3) NEC Corporation 4) JST CREST

IntroductionHPCG (High Performance Conjugate Gradient) has been developed to narrow the large performance gap between real applications and the HPL benchmark. The major features of HPCG are; - Including major communication and computational patterns of real applications. - Easy to understand, optimize, and run. - Able to examine memory and network performances. ex) indirect memory accesses, collective ops, p2p messages

Overview of the SX-ACE Supercomputer

Preliminary Evaluation before Optimization

Performance Evaluation of HPCG on SX-ACE

Conclusions

Crossbar

ADB 1MBMSHR

SPU VPUCore 1

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

256GB/s

Core 2

Core 3

Core 0

256GB/s

RCU

4GB

/s4G

B/s

256GB

/s

256GB/sCPU architecture

The SX-ACE supercomputer consists of 512 nodes. Each node is equipped with an SX-ACE processor, which can provide a high memory bandwidth for practical HPC applications. - High vector computational performance by a 4-core vector processor - High sustained memory bandwidth by a strong memory subsystem - ADB(Assignable Data Buffer) to keep a high sustained memory bandwidth - Scalable multinode system by a two-stage fat-tree custom network

Optimizations of HPCG for SX-ACE

<snip>GFLOP/s Summary: Raw DDOT: 28.2595 Raw WAXPBY: 17.8771 Raw SpMV: 0.441609 Raw MG: 0.586906 Raw Total: 0.577329 Total with convergence overhead: 0.577329<snip>

To decide the optimization plan, the reference HPCG is executed.Preliminary Evaluation EnvironmentsSupercomputer : SX-ACE# of nodes : 1Compiler : NEC SX C/C++ Compiler Version 1.0HPCG version : Release 2.4Problem size : 104 x 104 x 104 (default)Parallelization : Flat-MPI

HPCG-Benchmark-2.4.yamlThese low performances are caused by quite low vectorization rates andinefficient memory accesses. Optimiza- tions for efficient vector calculations and memory accesses are essential.

Data rearrangement for continuous memory accesses

matrixValues[0]

matrixValues[1]

matrixValues[i]

mtxIndL[0]

mtxIndL[1]

mtxIndL[i]

matrixValues[0]

matrixValues[1]

matrixValues[i]

mtxIndL[0]mtxIndL[1]

mtxIndL[i] 01234 1

2

34

01234

0

Data packing for vector-friendly matrixmemory allocation of sparse matrices

Eight-coloring

Parallelization approachesto eliminate data dependencies

CSRDefault matrix data storage

format of HPCG2.4

JADSpMV function in a highly opti- mized numeri-

cal library

ELLWell known as suitable for the vector archi-

tecture

Selective data store into the on-chip memory ADB- Only highly-reusable data are selectively stored in ADBBy exploiting programmer's knowledge about the reusability, data can selectively be stored in the on-chip memory ADB of the SX-ACE processor. All data that are considered reusable by the compiler are not always necessary.

Problem size tuning for efficient use of ADBTo avoid evicting highly-reusable data from ADB, the problem size of HPCG has been tuned under consideration of both the capacity of ADB and the size of hyperplanes. Especially, the size of each dimension becomes important for the hyperplane method.

To exploit high potential of the SX-ACE supercomputer on the HPCG benchmark, this poster discusses the optimization techniques; - Data rearrangement for efficient continuous memory accesses- Various sparse matrix memory allocation- the eight-coloring method and the hyperplane method for parallelization- Selectively data stored in ADB, and the problem size tuning.As a result, the SX-ACE supercomputer successfully achieves the highest efficiency of 11.4% in the latest HPCG ranking.

* Since an overhead of memory arrangement is too large for the SX-ACE processor, the row data rearrangement was not applied.

Performance improvements of the optimizations Effect of selective caching and size tuning Scalability and efficiency

Reference CG Timing & Residual Reduction

Reference SpMV and MG Kernel Timing

Optimized CG Setup

Optimized CG Timing and Analysis

Report Results

Problem Setup

Validation Testing

Hyperplane

0

2

4

6

8

10

12

14

16

0.0

0.1

1.0

10.0

100.0

1 2 4 8 16 32 64 128 256 512

Effic

ienc

y (%

)

HPC

G R

esul

ts (T

FLO

P/s)

Number of nodes

Performance Efficiency Vec. ratios of SpMV and MG become 99.37% and 99.23%

A: ReferenceB: JAD + ColoringC: ELL + ColoringD: ELL + HyperplaneF: ELL + Hyperplane + ADBG: ELL + Hyperplane + ADB + Size tuning

HPCG solves a linear system of a sparse matrix. Ax=bA is a large sparse matrix discretized by the finite element method.The linear system is solved by multigrid preconditioned conjugate gradient with the symmetric Gauss-Seidel smoother.

IXS network by two-stage fat-tree

RTR (spine)

RTR (spine)

RTR (edge)

RTR (edge)

plane #0

4GB/s x28GB/s x2

RTR (spine)

RTR (spine)

RTR (edge)

RTR (edge)

plane #1

node #000

RCU

node #015

RCU

node #496

RCU

node #511

RCU

IXS

16 ports

16 ports

32 ports

ADB of SX-ACE - Private on-chip memory - 1 MB, 4-way, 16-bank - 256 GB/s bandwidth - Customized for fast random accessesSoftware controllable function - Compiler and user can specify the use of the ADB - A bypass mechanism for memory instructions - Avoiding cache pollution - Enhancement of indirect memory access

22

24

26

28

30

32

34

88x2

16x5

12

96x2

16x4

64

104x

216x

432

112x

216x

400

120x

216x

376

MG

Res

ults

(GFL

OP/

s)

Problem size

Selective caching All data caching