An Approach to the Highest Efficiency of the HPCG...

0

5

10

15

20

25

30

35

A B C D E F

HPC

G R

esul

ts (G

FLO

P/s)

An Approach to the Highest Efficiency of the HPCG Benchmark on the SX-ACE SupercomputerKazuhiko Komatsu1)4), Ryusuke Egawa1)4), Yoko Isobe1)3), Ryusei Ogata3), Hiroyuki Takizawa2)4), Hiroaki Kobayashi1)

1) Cyberscience Center, Tohoku University 2) GSIS, Tohoku University 3) NEC Corporation 4) JST CREST

IntroductionHPCG (High Performance Conjugate Gradient) has been developed to narrow the large performance gap between real applications and the HPL benchmark. The major features of HPCG are; - Including major communication and computational patterns of real applications. - Easy to understand, optimize, and run. - Able to examine memory and network performances. ex) indirect memory accesses, collective ops, p2p messages

Overview of the SX-ACE Supercomputer

Preliminary Evaluation before Optimization

Performance Evaluation of HPCG on SX-ACE

Conclusions

Crossbar

ADB 1MBMSHR

SPU VPUCore 1

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

MC

256GB/s

Core 2

Core 3

Core 0

256GB/s

RCU

4GB

/s4G

B/s

256GB

/s

256GB/sCPU architecture

The SX-ACE supercomputer consists of 512 nodes. Each node is equipped with an SX-ACE processor, which can provide a high memory bandwidth for practical HPC applications. - High vector computational performance by a 4-core vector processor - High sustained memory bandwidth by a strong memory subsystem - ADB(Assignable Data Buffer) to keep a high sustained memory bandwidth - Scalable multinode system by a two-stage fat-tree custom network

Optimizations of HPCG for SX-ACE

<snip>GFLOP/s Summary: Raw DDOT: 28.2595 Raw WAXPBY: 17.8771 Raw SpMV: 0.441609 Raw MG: 0.586906 Raw Total: 0.577329 Total with convergence overhead: 0.577329<snip>

To decide the optimization plan, the reference HPCG is executed.Preliminary Evaluation EnvironmentsSupercomputer : SX-ACE# of nodes : 1Compiler : NEC SX C/C++ Compiler Version 1.0HPCG version : Release 2.4Problem size : 104 x 104 x 104 (default)Parallelization : Flat-MPI

HPCG-Benchmark-2.4.yamlThese low performances are caused by quite low vectorization rates andinefficient memory accesses. Optimiza- tions for efficient vector calculations and memory accesses are essential.

Data rearrangement for continuous memory accesses

matrixValues[0]

matrixValues[1]

matrixValues[i]

mtxIndL[0]

mtxIndL[1]

mtxIndL[i]

matrixValues[0]

matrixValues[1]

matrixValues[i]

mtxIndL[0]mtxIndL[1]

mtxIndL[i] 01234 1

2

34

01234

0

Data packing for vector-friendly matrixmemory allocation of sparse matrices

Eight-coloring

Parallelization approachesto eliminate data dependencies

CSRDefault matrix data storage

format of HPCG2.4

JADSpMV function in a highly optimized numeri-

cal library

ELLWell known as suitable for the vector archi-

tecture

Selective data store into the on-chip memory ADB- Only highly-reusable data are selectively stored in ADBBy exploiting programmer's knowledge about the reusability, data can selectively be stored in the on-chip memory ADB of the SX-ACE processor. All data that are considered reusable by the compiler are not always necessary.

Problem size tuning for efficient use of ADBTo avoid evicting highly-reusable data from ADB, the problem size of HPCG has been tuned under consideration of both the capacity of ADB and the size of hyperplanes. Especially, the size of each dimension becomes important for the hyperplane method.

To exploit high potential of the SX-ACE supercomputer on the HPCG benchmark, this poster discusses the optimization techniques; - Data rearrangement for efficient continuous memory accesses- Various sparse matrix memory allocation- the eight-coloring method and the hyperplane method for parallelization- Selectively data stored in ADB, and the problem size tuning.As a result, the SX-ACE supercomputer successfully achieves the highest efficiency of 11.4% in the latest HPCG ranking.

* Since an overhead of memory arrangement is too large for the SX-ACE processor, the row data rearrangement was not applied.

Performance improvements of the optimizations Effect of selective caching and size tuning Scalability and efficiency

Reference CG Timing & Residual Reduction

Reference SpMV and MG Kernel Timing

Optimized CG Setup

Optimized CG Timing and Analysis

Report Results

Problem Setup

Validation Testing

Hyperplane

0

2

4

6

8

10

12

14

16

0.0

0.1

1.0

10.0

100.0

1 2 4 8 16 32 64 128 256 512

Effic

ienc

y (%

)

HPC

G R

esul

ts (T

FLO

P/s)

Number of nodes

Performance Efficiency Vec. ratios of SpMV and MG become 99.37% and 99.23%

A: ReferenceB: JAD + ColoringC: ELL + ColoringD: ELL + HyperplaneF: ELL + Hyperplane + ADBG: ELL + Hyperplane + ADB + Size tuning

HPCG solves a linear system of a sparse matrix. Ax=bA is a large sparse matrix discretized by the finite element method.The linear system is solved by multigrid preconditioned conjugate gradient with the symmetric Gauss-Seidel smoother.

IXS network by two-stage fat-tree

RTR (spine)

RTR (spine)

RTR (edge)

RTR (edge)

plane #0

4GB/s x28GB/s x2

RTR (spine)

RTR (spine)

RTR (edge)

RTR (edge)

plane #1

node #000

RCU

node #015

RCU

node #496

RCU

node #511

RCU

IXS

16 ports

16 ports

32 ports

ADB of SX-ACE - Private on-chip memory - 1 MB, 4-way, 16-bank - 256 GB/s bandwidth - Customized for fast random accessesSoftware controllable function - Compiler and user can specify the use of the ADB - A bypass mechanism for memory instructions - Avoiding cache pollution - Enhancement of indirect memory access

22

24

26

28

30

32

34

88x2

16x5

12

96x2

16x4

64

104x

216x

432

112x

216x

400

120x

216x

376

MG

Res

ults

(GFL

OP/

s)

Problem size

Selective caching All data caching

An Approach to the Highest Efficiency of the HPCG...

Documents

Transcript of An Approach to the Highest Efficiency of the HPCG...