Optimizing Precision for High-Performance, A joint France...

Optimizing Precision for High-Performance, Robust, and Energy-Efficient Computations

RIKEN Center for Computational Science (R-CCS), JapanDaichi Mukunoki, Toshiyuki Imamura, Yiyu Tan,Atsushi Koshiba, Jens Huthmann, Kentaro Sano

Sorbonne University,CNRS, LIP6, FranceRoman Iakymchuk, Stef Graillat, Fabienne Jézéquel

Center for Computational Sciences, University of Tsukuba, JapanNorihisa Fujita, Taisuke Boku

We proposed a new systema.c approach for minimal-precisioncomputa.ons. This approach is robust, general, comprehensive,high-performant, and realis.c. Although the proposed system is s.llin development, it can be constructed by combining already available(developed) in-house technologies and extending them. Our ongoingstep is to demonstrate the system on a proxy applica.on

* All authors contributed equally to this research

In numerical computa.ons, the precision of floa.ng-pointcomputa.ons is a key factor to determine the performance(speed and energy-efficiency) as well as the reliability(accuracy and reproducibility). However, the precisiongenerally plays a contrary role for both. Therefore, theul.mate concept for maximizing both at the same .me is theop#mized/reduced precision computa#on through preci-sion-tuning, which adjusts the minimal precision for eachopera#on and data. Several studies have been alreadyconducted for it so far, but the scope of those studies islimited to the precision-tuning alone. Instead, we propose amore broad concept of the op.mized/minimal precision androbust (numerically reliable) compu.ng with precision-tuning, involving both hardware and soIware stack

Minimal-Precision Computing

Robust (Numeraically Reliable)To ensure the requested accuracy, the precision-tuning isprocessed based on numerical valida.on, guaranteeingalso reproducibility

GeneralOur scheme is applicable for any floating-pointcomputations. It contributes to low development cost andsustainability (easy maintenance and system portability)

ComprehensiveWe propose a total system from the precision-tuning to theexecu.on of the tuned code, combining heterogeneoushardware and hierarchical soIware stack

High-performancePerformance can be improved through the minimal-precision as well as fast numerical libraries and accelerators

Realis#cOur system can be realized by combining available in-housetechnologies

System Stack System Workflow

Low-level codefor FPGA

(VHDL etc.)

Compila.on and Execu.on on FPGA

Input:C code with MPFR(and MPLAPACK)

Precision-Op,mizer (with PROMISE

and CADNA/SAM)

Code Translation for FPGA

(SPGen, Nymble, FloPoCo)

C code with MPFR (op.mized)

Performance Optimization

C code with MPFR + other fast accurate

methods

Compilation and Execution

on CPU/GPU

YesA part of the C code with MPFR, which is executed on FPGA

Precision-Optimizer• The Precision-Optimizer

determines the minimal floating-point precisions, which need to achieve the desired accuracy

Performance Optimization• At this stage, if possible to speedup

some parts of the code with some other accurate computation methods than MPFR, those parts are replaced with them

• The required-accuracy must be taken into account

• If possible, it considers to utilize FPGA (as heterogeneous computing)

Hardware

Heterogeneous System

ArithmeticLibrary

Numerical Libraries

NumericalValidation

Precision Tuning

Arbitrary-Precision

FP32, FP64,(FP16, FP128)

Fast AccurateMethods/ Libraries

others…LAPACK

Stochastic ArithmeticSAMCADNA

PROMISE with SAM

PROMISE

acceleration

GPUGPUNymble

FloPoCo

available in development

SPGenfor arbitrary-prec.

Compiler & Tools for FPGA

others…OzBLASExBLAS

QPEigenMPLAPACK (MPBLAS)

MPLAPACK [7]An open-source multi-precision BLAS and LAPACK based on several high-precision tools such as MPFR, QD, and FP128

MPFR [2]A C library for multiple (arbitrary) precision floating-point computations on CPUs

FloPoCo [16]An open-source floating-point core generator for FPGA supporting arbitrary-precision

Energy-EfficientThrough the minimal-precision as well as the energy-efficient hardware accelera.on with FPGA and GPU

FPGA as an Arbitrary-Precision Computing Platform Fast and Accurate Numerical Libraries

Error-free transforma-on of dot-product ( xTy )

52115211

5211 52

OzBLAS (TWCU, RIKEN)• OzBLAS [13] is an accurate & reproducible BLAS

using Ozaki scheme [18], which is an accurate matrix mulCplicaCon method based on the error-free transformaCon of dot-product

• The accuracy is tunable and depends on the range of the inputs and the vector length

• CPU and GPU (CUDA) versions

ExBLAS (Sorbonne University)• ExBLAS [12] is an accurate & reproducible

BLAS based on floating-point expansionswith error-free transformations (EFT:twosum and twoprod) and super-accumulator

• Assures reproducibility through assuringcorrect-rounding: it preserves every bit ofinformation until the final rounding to thedesired format

• CPU (Intel TBB) and GPU (OpenCL) versions

(x', x') = split(x) { ρ = ⌈(log2(u−1) + log2(n + 1))/2⌉ τ = ⌈log2(max1 ≤ i ≤ n |xi|)⌉ σ = 2(ρ+τ)

x' = fl((x + σ) - σ) x' = fl(x - x')}

(1) x and y are split by split()(x(1), x(2)) = split(x), (y(1), y(2)) = split(y)

it is applied recursively until x(p+1) = y(q+1) = 0x = x(1) + x(2) +…+ x(p), y = y(1) + y(2) +…+ y(q)

(2) then, xTy is computed asxTy = (x(1))Ty(1) + (x(1))Ty(2) +…+ (x(1))Ty(q)

+ (x(2))Ty(1) + (x(2))Ty(2) +…+…+ (x(p))Ty(1) + (x(p))Ty(2) +…+ (x(p))Ty(q)

(x(i))Ty(j) is error-free: (x(i))Ty(j) = fl((x(i))Ty(j))

PCIe x16PCIe x16

Intel Xeon Gold

IB HDR100

Stratix10

IB HDR100

Stratix10

PLX PLX

PCIe x16 PCIe x16

Cygnus (University of Tsukuba)• Cygnus is the world first supercomputer

system equipped with both GPU (4xTesla V100) and FPGA (2x Stratix 10),installed in CCS, University of Tsukuba

Cygnus system

Double-double format

ExBLAS scheme

SPGen (RIKEN)• SPGen (Stream Processor Generator)

[14] is a compiler to generate HW module codes in Verilog-HDL for FPGA from input codes in Stream Processing DescripCon (SPD) Format. The SPD uses a data-flow graph representaCon, which is suitable for FPGA.

• It supports FP32 only, but we are going to extend SPGen to support arbitrary-precision floaCng-point. Currently, there is no FPGA compiler supporCng arbitrary-precision.

Stochastic Arithmetic Tools

CADNA & SAM (Sorbonne University)• CADNA (Control of Accuracy and Debugging for Numerical

Applications) [18] is a DSA library for FP16/32/64/128• CADNA can be used on CPUs in Fortran/C/C++ codes with

OpenMP & MPI and on GPUs with CUDA.• SAM (Stochastic Arithmetic in Multiprecision) [23] is a DSA

library for arbitrary-precision with MPFR.

PROMISE (Sorbonne University)

Precision tuning based on Delta-Debugging

Discrete StochasCc ArithmeCc (DSA) [21] enables us to esCmate rounding errors (i.e., the number of correct digits in the result) with 95% accuracy by execuCng the code 3 Cmes with random-rounding. DSA is a general scheme applicable for any floaCng-point operaCons: no special algorithms and no code modificaCon are needed. It is a light-weight approach in terms of performance, usability, and development cost compared to the other numerical verificaCon / validaCon methods.

QPEigen & QPBLAS (JAEA, RIKEN)• Quadruple-precision Eigen solvers (QPEigen) [8,

25] is based on double-double (DD) arithmetic. It is built on a quadruple-precision BLAS (QPBLAS) [9]. They support distributed environments with MPI: equivalent to ScaLAPACK’s Eigen solver and PBLAS

Arbitrary-precision arithmeCc is performed using MPFR on CPUs, but the performance is very low. To accelerate it, we are developing several numerical libraries supporCng accurate computaCon based on high-precision arithmeCc or algorithmic approach. Some sojware also support GPU acceleraCon.

FPGA enables us to implement arbitrary-precision on hardware. High-Level Synthesis (HLS) enables us to program it in OpenCL. However, compiling arbitrary-precision code and obtaining high performance are still challenging. Heterogeneous computing with FPGA & CPU/GPU is also a challenge

Name Core; ### Define IP core “Core”Main_In {in:: x0_0, x0_1, y0_0, y0_1};Main_Out {out::x2_0, x2_1, y2_0, y2_1};

### Description of parallel pipelines for t=0HDL pe10, 123, (x1_0, y1_0) = PE(x0_0, y0_0);HDL pe11, 123, (x1_1, y1_1) = PE(x0_1, y0_1);

### Description of parallel pipelines for t=1HDL pe20, 123, (x2_0, y2_0) = PE(x1_0, y1_0);HDL pe21, 123, (x2_1, y2_1) = PE(x1_1, y1_1);

Name PE; ### Define pipeline “PE”Main_In {in:: x_in, y_in};Main_Out {out::x_out, y_out};

EQU eq1, t1 = x_in * y_in;EQU eq2, t2 = x_in / y_in;EQU eq3, x_out = t1 + t2;EQU eq3, y_out = t1 - t2;

PEpe10

PEpe11

x0_0 y0_0 x0_1 y0_1

x1_0 y1_0 x1_1 y1_1

PEpe20

PEpe21

x2_0 y2_0 x2_1 y2_1

x_in y_in

x_out y_out

Module definition with data-flow graphby describing formulae of computation

Module definition with hardware structureby describing connections of modules

SPD codesOptimization tool + Clustering DFG+ Mapping clusters

SPGen+ Wrap nodes with

data-flow control logic+ Connect nodes w/ wires+ Equalize length of paths

Hardwaremodulesin HDL

C/C++ codesC/C++ Frontend+ LLVM based+ Polyhedral trans.

in development

OptimizedSPD codes

��

FPGAbitstreams

in development

FINISH

Host codes

CPUAvailable now

Nymble (TU Darmstadt, RIKEN)• Nymble [15] is another compiler project for

FPGA. It directly accepts C codes and has alreadystarted to support arbitrary-precision.

• It is more suited for non-linear memory accesspattern, like with graph based data structures.

(1) The same code is run several Cmes with the random rounding mode (results are rounded up / down with the same probability)

(2) Different results are obtained(3) The common part in the different

results is assumed to be a reliable result

floating-point code

3.14160...3.14161…3.14159…Several

Execu-ons withrandom-rounding Reliable result

• PROMISE (PRecision OpCMISE) [17] is a tool based on Delta-Debugging [24] to automaCcally tune the precision of floaCng-point variables in C/C++ codes

• The validity of the results is checked with CADNA

• We are going to extend PROMISE for arbitrary-precision with MPFR

• Each Stratix 10 FPGA has fourexternal links at 100Gbps. 64FPGAs make 8x8 2D-Torus networkfor communication

• This project targets such aheterogeneous system with FPGA.

Notation:• fl (…): computation

with floating-point arithmetic

• u: the unit round-off

Acknowledgement:This research was partially supported by the European Union's Horizon 2020 research, innovation programme under the MarieSkłodowska-Curie grant agreement via the Robust project No. 842528, the Japan Society for the Promotion of Science (JSPS)KAKENHI Grant No. 19K20286, and Multidisciplinary Cooperative Research Program in CCS, University of Tsukuba.

References[1] D. H. Bailey, “QD (C++/Fortran-90 double-double and quad-double package),” hop://crd.lbl.gov/~dhbailey/mpdist [2] G. Hanrot et al, “MPFR : GNU MPFR Library,” hop://www.mpfr.org[3] D. H. Bailey et al., “ARPREC: An Arbitrary Precision Computa.on Package,” Lawrence Berkeley Na.onal Laboratory Technical Report, No. LBNL-53651, 2002.[4] CAMPARY, hop://homepages.laas.fr/mmjoldes/campary[5] T. Ogita et al., ”Accurate Sum and Dot Product,” SIAM J. Sci. Comput., Vol. 26, pp. 1955-1988, 2005. [6] K. Ozaki et al., “Error-free transforma.ons of matrix mul.plica.on by using fast rou.nes of matrix mul.plica.on and its applica.ons,” Numer. Algorithms, Vol. 59, No. 1, pp. 95-118, 2012.[7] M. Nakata, “The MPACK (MBLAS/MLAPACK); a mul.ple precision arithme.c version of BLAS and LAPACK,” Ver. 0.6.7, hop://mplapack.sourceforge.net, 2010.[8] Japan Atomic Energy Agency, “Quadruple Precision Eigenvalue Calcula.on Library QPEigen Ver.1.0 User’s Manual,” hops://ccse.jaea.go.jp/ja/download/qpeigen_k/qpeigen_manual_en-1.0.pdf, 2015.[9] Japan Atomic Energy Agency, “Quadruple Precision BLAS Rou.nes QPBLAS Ver.1.0 User’s Manual,” hops://ccse.jaea.go.jp/ja/download/qpblas/1.0/qpblas_manual_en-1.0.pdf, 2013.[10] X. Li et al., “XBLAS – Extra Precise Basic Linear Algebra Subrou.nes,” hop://www.netlib.org/xblas [11] P. Ahrens et al., “ReproBLAS – Reproducible Basic Linear Algebra Sub-programs,” hops://bebop.cs.berkeley.edu/reproblas in Asia Poster, Interna.onal Supercompu.ng Conference (ISC’16), 2016.[12] R.Iakymchuk et al., “ExBLAS: Reproducible and Accurate BLAS Library,” Proc. Numerical Reproducibility at Exascale(NRE2015) workshop at SC15, HAL ID: hal-01202396, 2015.

Our Contributions

Conclusion & Future Work[13] D. Mukunoki et al., “Accurate and Reproducible BLAS Routines with Ozaki Scheme for Many-core Architectures,” Proc. PPAM2019, 2019 (accepted). [14] K. Sano et al., “Stream Processor Generator for HPC to Embedded Applications on FPGA-based System Platform,” Proc. First International Workshop on FPGAs for Software Programmers (FSP 2014), pp. 43-48, 2014. [15] J. Huthmann et al., “Hardware/software co-compilation with the Nymble system,” 2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC), pp. 1-8, 2013.[16] F. Dinechin and B. Pasca, “Designing custom arithmetic data paths with FloPoCo,” IEEE Design & Test of Computers, Vol. 28, No. 4, pp. 18-27, 2011.[17] S. Graillat et al., “Auto-tuning for floating-point precision with Discrete Stochastic Arithmetic,” Journal of Computational Science, Vol.36, 101017, 2019.[18] F. Jézéquel and J.-M. Chesneaux, “CADNA: a library for estimating round-off error propagation,” Computer Physics Communications, Vol. 178, No. 12, pp. 933-955, 2008.[19] F. Févotte and B. Lathuilière, “Debugging and optimization of HPC programs in mixed precision with the Verrou tool,” hal-02044101, 2019.[20] C. Rubio-González et al., “Precimonious: Tuning Assistant for Floating-Point Precision,” Proc. SC’13, 2013.[21] J. Vignes, “Discrete Stochastic Arithmetic for Validating Results of Numerical Software,” Numer. Algorithms, Vol. 37, No. 1-4, pp. 377-390, 2004.[22] I. Laguna, P. C. Wood, R. Singh, S. Bagchi, “GPUMixer: Performance-Driven Floating-Point Tuning for GPU Scientific Applications,” Proc. ISC2019, pp. 227-246, 2019.[23] S. Graillat, et al., ”Stochastic Arithmetic in Multiprecision,” Mathematics in Computer Science, Vol. 5, No. 4, pp. 359-375, 2011.[24] A. Zeller and R. Hildebrandt, “Simplifying and isolating failure-inducing input,” IEEE Trans. Softw. Eng., Vol. 28, No. 2, pp. 183-200, 2002. [25] Y. Hirota et al., “Performance of quadruple precision eigenvalue solver libraries QPEigenK and QPEigenG on the K computer”, HPC in Asia Poster, International Supercomputing Conference (ISC’16), 2016.

Minimal-precision compu0ng is both reliable (aka robust) and sustainable as it ensures the requested accuracy of the result as well as is energy-efficient

Our Proposal

Available Components Red: Components developed by us

(2) Arbitrary-precision libraries and fast accurate numerical libraries

• Reduced-/mixed-precision with FP16/FP32/FP64 enables us to improve performance & energy-efficiency

• High-precision libraries and fast accurate computa.on methods have been developed for reliable & reproducible computa,on

(3) Field-Programmable Gate Array (FPGA) with High-Level Synthesis (HLS)

• FPGA enables us to implement any operations on hardware, including arbitrary-precision operations

• HLS enables us to use FPGAs through existing programming languages such as C/C++ and OpenCL

• FPGA can be used to perform arbitrary-precision computations on hardware efficiently (high-performance and energy-efficient)

(1) Precision-tuning with numerical validation based on stochastic arithmetic

• Rounding-errors can be es.mated stochas.cally with a reasonable cost (for details, see “(A) Stochas.c Arithme.c Tools” at the booom leI)

• General scheme applicable for any floa,ng-point computa,ons Tools:

• High-precision arithmetic: binary128 (intel, gcc), QD [1], MPFR [2], ARPREC [3], CAMPARY [4], etc.

• Accurate sum/dot: AccSum/Dot [5], Ozaki-scheme [6], etc. • Numerical libraries: MPLAPACK [7], QPEigen [8],

QPBLAS [9], XBLAS [10], ReproBLAS [11], ExBLAS [12], OzBLAS [13], etc.

Tools:• Compilers: SPGen [14], Nymble [15], etc.• Custom floating-point operation generator:

FloPoCo [16], etc.

Tools:• PROMISE [17] (based on a stochastic arithmetic

library, CADNA [18]), Verrou [19], etc.

Related works (not validation-based):• Precimonious [20], GPUMixer [22] etc.

HPC Asia 2020 – Interna.onal Conference on High-Performance Compu.ng in Asia-Pacific Region, Fukuoka, Japan, January 15-17, 2020

Introduction

A joint France-Japan research project

acceleration

Optimizing Precision for High-Performance, A joint France...

Documents

Transcript of Optimizing Precision for High-Performance, A joint France...

An Overview of Post-K Development - 情報処理学会sighpc.ipsj.or.jp/HPCAsia2018/images/YutakaIshikawa_slides.pdf · Oakforest‐PACS supercomputer, 25 PF in peak, at JCAHPC organized

apps.fortworthtexas.govapps.fortworthtexas.gov/platdirectory/Plats/FP16-019.pdf · wastewater distribution systems and treatment ... or plant of any type may ... Suite 1400, Dallas,

Mixed-Precision Iterative Reﬁnement using Tensor Cores on ...when the (working, residual) precision is (FP64, FP128). Analysis covering this GMRES-based approach when two precisions

Investigating the Benefit of FP16-Enabled Mixed …...Investigating the Beneﬁt of FP16-Enabled Mixed-Precision Solvers for Symmetric Positive Deﬁnite Matrices Using GPUs Ahmad

ChOWDER: A VDA-Based Scalable Display System for ...sighpc.ipsj.or.jp/HPCAsia2020/hpcasia2020_poster_abstracts/poster... · the HIVE visualization system [3] via ChOWDER’s Web-API.

Laserspektroskopie - mathphys.fsk.uni-heidelberg.depseyfert/fp16.pdf · Abstract: According to Lectures introducing to Physics, a measurement is always a comparison to a once set

CUDA 11: NEW FEATURES AND BEYOND...13 ANNOUNCING THE NVIDIA AMPERE GPU ARCHITECTURE V100 A100 SMs 80 108 Tensor Core Precision FP16 FP64, TF32, BF16, FP16, I8, I4, B1 Shared Memory

the relationship with our clients · 2018. 5. 4. · FP128 application example linear – line amplifiers – pre amplifiers 21 . diagram of central installation using linear FP 128

The Birth of the Administrative State: Where It Came From ...s3.amazonaws.com/thf_media/2007/pdf/fp16.pdf · The Birth of the Administrative State: Where It Came From ... state can

A Case Study on Optimizing Accurate Half Precision Average · - k-means, meanshift, average pooling FP16 hardware support incoming - Pascal GPU, ARM SVE FP16 precision imposes serious

NVIDIA™ GPU Performance Testing and PowerAI on IBM Cloud...Mixed precision (FP16/32) leverages Tensor Cores in V100 GPUs. SoftLayer bare-metal server has 48 logical CPU cores while

Accelerating Fast Fourier Transform with half-precision ... · Single to Half Precision To keep the accuracy, we split a FP32 number to the scaled sum of two FP16 number, and make

Unity Technologies · •Unofﬁcial numbers, some based on our measurements. Numbers might be wrong! Numbers are peak values. • TEX - bilinear texture fetch • FP32-FP16 - supports

JOHN RITCHIE DIANE LANGMORE - ADB Homeadb.anu.edu.au/frontpages/fp16.pdf · JOHN RITCHIE Deputy General ... the Rosicrucian Order, Royal Prince Alfred Hospital, ... registrars-general

NVIDIA TO ACCELERATE THE HPC-AI CONVERGENCE · 2020-03-23 · BIG DATA IN SCIENCE Big Science ingests/outputs Big Data Large Hadron Collider Square Kilometer Array Johns ... (FP32/FP16)

Mixed Precision Training - NVIDIA · Guideline #1 for mixed precision: weight update •FP16 mantissa is sufficient for some networks, some require FP32 •Sum of FP16 values whose

Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Installation Manual INMARSAT-1 MES Model FELCOM18 · iii EQUIPMENT LISTS Standard supply Optional supply Name Type Code No. Qty Remarks Antenna Unit IC-118 - 1 w/FP16-02501 Terminal

SEE PAGE THREE FOR ADDRESSES - Fort Worth, Texasapps.fortworthtexas.gov/platdirectory/Plats/FP16-013.pdf · skysail road bollard drive boat wind road topsail drive sloop street lake

apps.fortworthtexas.govapps.fortworthtexas.gov/platdirectory/Plats/FP16-102.pdffc34 fc35 fc36 f23 fc37 24' emergency access easement (by this plat) 30' gas pipeline easement cc no.