An Approach to the Highest Efficiency of the HPCG...

A B C D E F

An Approach to the Highest Efficiency of the HPCG Benchmark on the SX-ACE SupercomputerKazuhiko Komatsu1)4), Ryusuke Egawa1)4), Yoko Isobe1)3), Ryusei Ogata3), Hiroyuki Takizawa2)4), Hiroaki Kobayashi1)

1) Cyberscience Center, Tohoku University 2) GSIS, Tohoku University 3) NEC Corporation 4) JST CREST

IntroductionHPCG (High Performance Conjugate Gradient) has been developed to narrow the large performance gap between real applications and the HPL benchmark. The major features of HPCG are; - Including major communication and computational patterns of real applications. - Easy to understand, optimize, and run. - Able to examine memory and network performances. ex) indirect memory accesses, collective ops, p2p messages

Overview of the SX-ACE Supercomputer

Preliminary Evaluation before Optimization

Performance Evaluation of HPCG on SX-ACE

Conclusions

Crossbar

ADB 1MBMSHR

SPU VPUCore 1

256GB/s

Core 2

Core 3

Core 0

256GB/s

256GB/sCPU architecture

The SX-ACE supercomputer consists of 512 nodes. Each node is equipped with an SX-ACE processor, which can provide a high memory bandwidth for practical HPC applications. - High vector computational performance by a 4-core vector processor - High sustained memory bandwidth by a strong memory subsystem - ADB(Assignable Data Buffer) to keep a high sustained memory bandwidth - Scalable multinode system by a two-stage fat-tree custom network

Optimizations of HPCG for SX-ACE

<snip>GFLOP/s Summary: Raw DDOT: 28.2595 Raw WAXPBY: 17.8771 Raw SpMV: 0.441609 Raw MG: 0.586906 Raw Total: 0.577329 Total with convergence overhead: 0.577329<snip>

To decide the optimization plan, the reference HPCG is executed.Preliminary Evaluation EnvironmentsSupercomputer : SX-ACE# of nodes : 1Compiler : NEC SX C/C++ Compiler Version 1.0HPCG version : Release 2.4Problem size : 104 x 104 x 104 (default)Parallelization : Flat-MPI

HPCG-Benchmark-2.4.yamlThese low performances are caused by quite low vectorization rates andinefficient memory accesses. Optimiza- tions for efficient vector calculations and memory accesses are essential.

Data rearrangement for continuous memory accesses

matrixValues[0]

matrixValues[1]

matrixValues[i]

mtxIndL[0]

mtxIndL[1]

mtxIndL[i]

matrixValues[0]

matrixValues[1]

matrixValues[i]

mtxIndL[0]mtxIndL[1]

mtxIndL[i] 01234 1

Data packing for vector-friendly matrixmemory allocation of sparse matrices

Eight-coloring

Parallelization approachesto eliminate data dependencies

CSRDefault matrix data storage

format of HPCG2.4

JADSpMV function in a highly optimized numeri-

cal library

ELLWell known as suitable for the vector archi-

tecture

Selective data store into the on-chip memory ADB- Only highly-reusable data are selectively stored in ADBBy exploiting programmer's knowledge about the reusability, data can selectively be stored in the on-chip memory ADB of the SX-ACE processor. All data that are considered reusable by the compiler are not always necessary.

Problem size tuning for efficient use of ADBTo avoid evicting highly-reusable data from ADB, the problem size of HPCG has been tuned under consideration of both the capacity of ADB and the size of hyperplanes. Especially, the size of each dimension becomes important for the hyperplane method.

To exploit high potential of the SX-ACE supercomputer on the HPCG benchmark, this poster discusses the optimization techniques; - Data rearrangement for efficient continuous memory accesses- Various sparse matrix memory allocation- the eight-coloring method and the hyperplane method for parallelization- Selectively data stored in ADB, and the problem size tuning.As a result, the SX-ACE supercomputer successfully achieves the highest efficiency of 11.4% in the latest HPCG ranking.

* Since an overhead of memory arrangement is too large for the SX-ACE processor, the row data rearrangement was not applied.

Performance improvements of the optimizations Effect of selective caching and size tuning Scalability and efficiency

Reference CG Timing & Residual Reduction

Reference SpMV and MG Kernel Timing

Optimized CG Setup

Optimized CG Timing and Analysis

Report Results

Problem Setup

Validation Testing

Hyperplane

1 2 4 8 16 32 64 128 256 512

Number of nodes

Performance Efficiency Vec. ratios of SpMV and MG become 99.37% and 99.23%

A: ReferenceB: JAD + ColoringC: ELL + ColoringD: ELL + HyperplaneF: ELL + Hyperplane + ADBG: ELL + Hyperplane + ADB + Size tuning

HPCG solves a linear system of a sparse matrix. Ax=bA is a large sparse matrix discretized by the finite element method.The linear system is solved by multigrid preconditioned conjugate gradient with the symmetric Gauss-Seidel smoother.

IXS network by two-stage fat-tree

RTR (spine)

RTR (edge)

plane #0

4GB/s x28GB/s x2

RTR (spine)

RTR (edge)

plane #1

node #000

node #015

node #496

node #511

16 ports

32 ports

ADB of SX-ACE - Private on-chip memory - 1 MB, 4-way, 16-bank - 256 GB/s bandwidth - Customized for fast random accessesSoftware controllable function - Compiler and user can specify the use of the ADB - A bypass mechanism for memory instructions - Avoiding cache pollution - Enhancement of indirect memory access

Problem size

Selective caching All data caching

An Approach to the Highest Efficiency of the HPCG...

Documents

Transcript of An Approach to the Highest Efficiency of the HPCG...

Everett Phillips Massimiliano Fatica A CUDA IMPLEMENTATION OF THE HPCG BENCHMARK.

Multi&GPU*Graph*Analytics Yuechao Pan, Yangzihao …sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · Multi&GPU*Graph*Analytics Yuechao Pan, Yangzihao Wang, Yuduo

Scalable Mesh Generation for HPC Applicationssc15.supercomputing.org/sites/all/themes/SC15images/tech...interfaces(geometric proximity based, global id based) - exchange ghost layers,

Performance modeling of the HPCG benchmarksdh/pmbs14/PMBS14/Workshop_Schedule... · HPL benchmark and TOP500 • HPL is de facto the most important benchmark for ranking supercomputers

Distributed NoSQL Storage for Extreme-scale System Servicessc15.supercomputing.org/sites/all/themes/SC15images/... · 2016-05-10 · SQL Databases Large Various - small O(10) ms Very

A CUDA IMPLEMENTATION OF THE HPCG BENCHMARK€¦ · Everett Phillips Massimiliano Fatica A CUDA IMPLEMENTATION OF THE HPCG BENCHMARK . OUTLINE Motivation Overview Optimization Performance

HPCG Honors Its Volunteers - Hospice and Palliative Care ... · If you use items in several different places, ... may help with household chores including cooking, laundry, shopping,

Urban Ecosystem Design - HPCG Laboratoryhpcg.purdue.edu/papers/Benes11I3D.pdf · 2016-01-29 · Urban Ecosystem Design Bedˇrich Benesˇ∗ Michel Abdul Massih Philip Jarvis PurdueUniversity

HPCG Technical Specification - Sandia National … REPORT SAND2013-!8752 Unlimited Release Printed October 2013 HPCG Technical Specification Michael A. Heroux, Sandia National Laboratories1

The HPC Conjugate Gradient (HPCG)

Large Scale Artificial Neural Network Training …sc15.supercomputing.org/sites/all/themes/SC15images/tech...Large Scale Artificial Neural Network Training Using Multi-GPUs Introduction

HPCI System Development Team Software Development Team · Under the measurement rule of the V3.0, the performance of computational kernel is same to V2.4, but the HPCG score will

HPCG benchmark for characterising performance of SoC ... · 1.2 Project Outline Chapter 2 provides the background and motivation for the project. Chapter 3 details the HPCG performance

Student Cluster Competitionsc15.supercomputing.org/sites/all/themes/SC15images/...The Student Cluster Competition is only partially funded by the SC conference. To make this inspiring

Online Virtual Learning Environments - HPCG Laboratoryhpcg.purdue.edu/idealab/pubs/OVLE14.pdf · 2014-11-26 · Virtual Learning Environments As education incorporates newer technologies,

OPESCI: Open Peformance portablE Seismic Imagingsc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · OPESCI: Open Peformance portablE Seismic Imaging ... (DSL)andcode

HPCG benchmark for characterising performance …courses.cecs.anu.edu.au/courses/CSPROJECTS/15S2/Reports/...HPCG benchmark for characterising performance of SoC devices Rabi Javed

Out$of$core*Sorting*Acceleration using*GPU*and*Flash*NVMsc15.supercomputing.org/sites/all/themes/SC15images/tech... · 2016-05-10 · 0" 50,000,000" 100,000,000" 150,000,000" 200,000,000"

Lights and Lighting - HPCG Laboratoryhpcg.purdue.edu/bbenes/classes/CGT340/lectures/03-Light.pdf2/3/2013 1 © BedrichBenes Lights and Lighting Digital Lighting and Rendering CGT 340

Active Global Address Space (AGAS)sc15.supercomputing.org/sites/all/themes/SC15images/doctoral_showc… · This is necessary to implement blocked allocations in AGAS, as that information

Multi&GPUGraphAnalytics Yuechao Pan, Yangzihao …sc15.supercomputing.org/sites/all/themes/SC15images/tech_poster/... · Multi&GPUGraphAnalytics Yuechao Pan, Yangzihao Wang, Yuduo

Out$of$coreSortingAcceleration usingGPUandFlashNVMsc15.supercomputing.org/sites/all/themes/SC15images/tech... · 2016-05-10 · 0" 50,000,000" 100,000,000" 150,000,000" 200,000,000"