Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate...

22
Accurate BLAS Implementations: OzBLAS and BLAS-DOT2 LSPANC2020January, January 30, 2020, R-CCS Daichi Mukunoki (R-CCS) Research Collaborators: Takeshi Ogita (Tokyo Woman’s Christian University) Katsuhisa Ozaki (Shibaura Institute of Technology) Acknowledgement: This research was partially supported by MEXT as “Exploratory Issue on Post-K computer” (Development of verified numerical computations and super high performance computing environment for extreme researches) and the Japan Society for the Promotion of Science (JSPS) KAKENHI Grant Number 19K20286.

Transcript of Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate...

Page 1: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Accurate BLAS Implementations:OzBLAS and BLAS-DOT2

LSPANC2020January, January 30, 2020, R-CCSDaichi Mukunoki (R-CCS)

Research Collaborators:Takeshi Ogita (Tokyo Woman’s Christian University)Katsuhisa Ozaki (Shibaura Institute of Technology)

Acknowledgement:This research was partially supported by MEXT as “Exploratory Issue on Post-K computer” (Development of verified numericalcomputations and super high performance computing environment for extreme researches) and the Japan Society for thePromotion of Science (JSPS) KAKENHI Grant Number 19K20286.

Page 2: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Fast Accurate Multi-Precision Numerical Libraryl Needs in Minimal-Precision Computing

1. Fast computation of arbitrary-precision (incl. accurate) result on CPU/GPU2. Accurate numerical validation (or verification)

2

Hardware

Heterogeneous System

ArithmeticLibrary

Numerical Libraries

NumericalValidation

Precision Tuning

Arbitrary- Precision

FP32, FP64,(FP16, FP128)

Fast AccurateMethods / Libraries

others…LAPACK

BLAS

MPFR

Stochastic ArithmeticSAMCADNA

PROMISE with SAM

PROMISE

DD/QD

FPGA

acceleration

GPUGPUNymble

FloPoCo

available

in development

SPGenfor arbitrary-prec.

Compiler & Tools for FPGA

CPUacceleration

others…OzBLASExBLAS

QPEigenMPLAPACK (MPBLAS)

Page 3: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Accurate BLAS Implementations

3

Platform DataPrecision

ComputationPrecision

BLAS (netlib) CPU, GPU FP32/64 FP32/64

XBLAS (netlib) CPU FP32/64 FP32/64/2xFP64

BLAS-DOT2 (TWCU) GPU FP64 2xFP64

ReproBLAS (Berkeley) CPU FP64 Tunable

OzBLAS (TWCU/RIKEN) CPU, GPU FP64 Tunable (correct-rounding)

ExBLAS (Sorbonne U.) CPU, GPU FP64 Correct rounding

QPBLAS (JAEA/RIKEN) CPU, GPU DD DD

MBLAS / MPLAPACK

(M. Nakata)CPU

FP64/DD/

QD,MPFRFP64/DD/QD/MPFR

Examples of accurate BLAS implementations

Red: our work

Page 4: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Accurate BLAS Implementations

4

Platform DataPrecision

ComputationPrecision

BLAS (netlib) CPU, GPU FP32/64 FP32/64

XBLAS (netlib) CPU FP32/64 FP32/64/2xFP64

BLAS-DOT2 (TWCU) GPU FP64 2xFP64

ReproBLAS (Berkeley) CPU FP64 Tunable

OzBLAS (TWCU/RIKEN) CPU, GPU FP64 Tunable (correct-rounding)

ExBLAS (Sorbonne U.) CPU, GPU FP64 Correct rounding

QPBLAS (JAEA/RIKEN) CPU, GPU DD DD

MBLAS / MPLAPACK

(M. Nakata)CPU

FP64/DD/

QD,MPFRFP64/DD/QD/MPFR

Examples of accurate BLAS implementations

Red: our work

Page 5: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

BLAS-DOT2n BLAS-DOT2 [1]

l Input/output: FP64, computation: 2x FP64 (same as XBLAS)l Based on a 2-fold precision inner-product algorithm: “Dot2” [2]

¥ Based on error-free addition & multiplication, similar to double-double¥ Accuracy depends on the condition number

l Theoretical performance overhead compared to cuBLAS FP64 routines¥ DOT & GEMV: no overhead, GEMM: 11x slower

5

[1] Daichi Mukunoki, Takeshi Ogita: Performance and Energy Consumption of Accurate and Mixed-precision Linear Algebra Kernels on GPUs, Journal of Computational and Applied Mathematics, Vol. 372, p. 112701, 2020.[2] T. Ogita, S. M. Rump, S. Oishi, Accurate Sum and Dot Product, SIAM J. Scientific Computing 26 (2005) 1955–1988. Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html

Page 6: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Accurate Inner-Product: Dot2 [Ogita et al. 2005]

6

Notation:• fl (…): computation performed with floating-point arithmetic (in our case FP64)• fma (…): computation performed with fused multiply-add• ulp (a): unit in the last place

Accurate inner-product “Dot2” [Ogita et al. 2005]: r = xTy, r, x, y ∈ Fn

function r = Dot2 (x, y)[p, s] = TwoProd (x0, y0)for i = 1 : n-1

[h, r] = TwoProd (xi, yi)[p, q] = TwoSum (p, h)s = fl (s + (q + r))

r = fl (p + s)end function

Page 7: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Accurate Inner-Product: Dot2 [Ogita et al. 2005]

7

function [s, e] = TwoSum (a, b)s = fl (a + b)v = fl (s - a)e = fl (a - (s - v)) + (b - v))

end function

function [p, e] = TwoProd (a, b)p = fl (a×b)e = fma (a×b - p)

end function

Error-free transformation for addition [Knuth 1969]: s + e = a + b, s = fl (a + b)

Error-free transformation for multiplication [Karp et al. 1997]: p + e = a × b, p = fl (a × b)

Notation:• fl (…): computation performed with floating-point arithmetic (in our case FP64)• fma (…): computation performed with fused multiply-add• ulp (a): unit in the last place

function r = Dot2 (x, y)[p, s] = TwoProd (x0, y0)for i = 1 : n-1

[h, r] = TwoProd (xi, yi)[p, q] = TwoSum (p, h)s = fl (s + (q + r))

r = fl (p + s)end function

Accurate inner-product “Dot2” [Ogita et al. 2005]: r = xTy, r, x, y ∈ Fn

Page 8: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Accurate Inner-Product: Dot2 [Ogita et al. 2005]

8

function [s, e] = TwoSum (a, b)s = fl (a + b)v = fl (s - a)e = fl (a - (s - v)) + (b - v))

end function

function [p, e] = TwoProd (a, b)p = fl (a×b)e = fma (a×b - p)

end function

Error-free transformation for addition [Knuth 1969]: s + e = a + b, s = fl (a + b)

Error-free transformation for multiplication [Karp et al. 1997]: p + e = a × b, p = fl (a × b)

Notation:• fl (…): computation performed with floating-point arithmetic (in our case FP64)• fma (…): computation performed with fused multiply-add• ulp (a): unit in the last place

function r = Dot2 (x, y)[p, s] = TwoProd (x0, y0)for i = 1 : n-1

[h, r] = TwoProd (xi, yi)[p, q] = TwoSum (p, h)s = fl (s + (q + r))

r = fl (p + s)end function

Accurate inner-product “Dot2” [Ogita et al. 2005]: r = xTy, r, x, y ∈ Fn

11 instructions(11x overhead thanFP64 DOT using FMA)

Page 9: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Implementationn Matrix multiplication (GEMM) on CUDA

l Each thread computes an element of matrix C using Dot2l 2D thread-block and 2D grid utilizing 2D data locality on matrixl Shared-memory blocking l To achieve nearly the theoretical peak performance is not so hard thanks to the high

compute-intensity (11x higher than double)

9

k

m

k

n

Matrix A

Matrix B

Matrix C

16 elements16 threads

16 elements16 threads

Shared-memoryA thread-blockcomputesa block

A thread computesan element in a block

Page 10: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Performance on Titan V

10

Throughput (Ops/s) on Titan V (Volta), CUDA 10.0

0

1

2

3

4

5

6

7

0 2048 4096

6144 8192

TOps

/s

Problem Size (m=n=k)

DGEMM

cuBLAS DOT2

0 20 40 60 80

100 120 140 160 180

0 2048 4096

6144 8192

10240

GO

ps/s

Problem Size (m=n)

DGEMV

cuBLAS DOT2

Execution time overhead of Dot2 v.s. cuBLAS FP64 routine:• DOT & GEMV: Almost no overhead as memory-bound• GEMM: 11x slower as compute-bound

0

10

20

30

40

50

60

70

80

2 14 2 16 2 18 2 20 2 22 2 24

GO

ps/s

Problem Size (n)

DDOT

cuBLAS DOT2

Ops: number of mathematical computations of the BLAS operation

11x

Bet

ter

Page 11: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Accurate BLAS Implementations

11

Platform DataPrecision

ComputationPrecision

BLAS (netlib) CPU, GPU FP32/64 FP32/64

XBLAS (netlib) CPU FP32/64 FP32/64/2xFP64

BLAS-DOT2 (TWCU) GPU FP64 2xFP64

ReproBLAS (Berkeley) CPU FP64 Tunable

OzBLAS (TWCU/RIKEN) CPU, GPU FP64 Tunable (correct-rounding)

ExBLAS (Sorbonne U.) CPU, GPU FP64 Correct rounding

QPBLAS (JAEA/RIKEN) CPU, GPU DD DD

MBLAS / MPLAPACK

(M. Nakata)CPU

FP64/DD/

QD,MPFRFP64/DD/QD/MPFR

Examples of accurate BLAS implementations

Red: our work

Page 12: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

OzBLASn OzBLAS [3]

l Input/output: FP64, computation: tunable (including correct-rounding)

l Based on the error-free transformation for dot/mat-mul: “Ozaki scheme” [3]

¥ Accuracy depends on the range of the absolute values and the dimension (inner-

product direction) of the input

l Full-scratch implementation is not needed: the kernel computation (most time-

consuming part) can be performed using standard FP64 BLAS

¥ High performance and low development cost

l Speed compared to cuBLAS depends on the input

¥ At least more than 4x (on GEMM) and 8x (on DOT & GEMV)

12

[3] D. Mukunoki, T. Ogita, K. Ozaki: Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-core Architectures, Proc. PPAM2019, 2019 (accepted).[4] K. Ozaki, T. Ogita, S. Oishi, S. M. Rump: Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications. Numer. Algorithms 59(1), 95–118 (2012) Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html

Page 13: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Error-Free Transformation for Dot/Mat-mul: Ozaki Scheme [Ozaki et al. 2012]

13

(1) Input vectors x, y are split into several vectors:x = x(1) + x(2) + ··· + x(p) + ··· + x(sx)

y = y(1) + y(2) + ··· + y(q) + ··· + y(sy) x(p), y(q) ∈ Fn

Here, the splitting is performed to meet the following properties:1. (x(p))Ty(q) = fl ((x(p))Ty(q)) for any p and q 2. x(p)(i) and y(q)(j) are non-zeros, |x(p)(i)| ≧ |x(p+1)(i)|, |y(q)(j)| ≧ |y(q+1)(j)|

Ø The number of split vectors required to achieve correct-rounding depends on the range of the absolute values and length of the input vectors

DOT (xTy, x, y ∈ Fn) with Ozaki scheme:

Page 14: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Error-Free Transformation for Dot/Mat-mul: Ozaki Scheme [Ozaki et al. 2012]

14

(1) Input vectors x, y are split into several vectors:x = x(1) + x(2) + ··· + x(p) + ··· + x(sx)

y = y(1) + y(2) + ··· + y(q) + ··· + y(sy) x(p), y(q) ∈ Fn

Here, the splitting is performed to meet the following properties:1. (x(p))Ty(q) = fl ((x(p))Ty(q)) for any p and q 2. x(p)(i) and y(q)(j) are non-zeros, |x(p)(i)| ≧ |x(p+1)(i)|, |y(q)(j)| ≧ |y(q+1)(j)|

Ø The number of split vectors required to achieve correct-rounding depends on the range of the absolute values and length of the input vectors

(2) xTy is transformed to a summation of several dot-productsxTy = (x(1))Ty(1) + (x(1))Ty(2) + (x(1))Ty(3) + ··· + (x(1))Ty(sy)

+ (x(2))Ty(1) + (x(2))Ty(2) + (x(2))Ty(3) + ··· + (x(2))Ty(sy)

+ (x(3))Ty(1) + (x(3))Ty(2) + (x(3))Ty(3) + ··· + (x(3))Ty(sy)

+ ···+ (x(sx))Ty(1) + (x(sx))Ty(2) + (x(sx))Ty(3) + ··· + (x(sx))Ty(sy)

With a correctly-rounded summation (e.g. NearSum [Rump et al. 2005]), the final result achieves correct-rounding.

DOT (xTy, x, y ∈ Fn) with Ozaki scheme:

Those parts can be computed using FP64 operation (DDOT)

Page 15: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

GEMM with Ozaki scheme

15C=Sum(C(0)+C(1)+C(2)+C(3))

Splitting of input matrices A and B

Multiplications of the split matrices by DGEMM

Summation of the computed split matrices by NearSum

A

B

B(1)B(2)

B(3)

B(sy)

A(1)A(2)

A(3)

A(sx)…

B(sy)

A(1)

A(2)

A(3)

A(sx)…

B(3)B(2)B(1)

C(1, 1) C(1, 2)

C(2,2) C(2,3)C (2, 1)

C(3, 1)

C(1,sy)C(1, 3)

C(3,2) C(3,sy)C(3, 3)

C(2,sy)

C(sx, 1) C(sx,2) C(sx,sy)C(sx,3)

… …

C(1, 1)C(1, 2)

C(1,p)

C

C(2,1)

C(2,2)

…sy

matrice

s

sxmatr

ices

Summati

on

Split

Split1

3

Matrix Multiplications2 Those GEMMs can be computed using standard DGEMMs

Page 16: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Performance on Titan V

16

Throughput (Ops/s) of Correctly-Rounded Operation on Titan V (Volta), CUDA 10.1

0

10

20

30

40

50

60

70

80

2 14 2 16 2 18 2 20 2 22 2 24

GO

ps/s

Problem Size (n)

DDOT

cuBLASOzaki(p=0)Ozaki(p=1)

0

1

2

3

4

5

6

7

0 2048 4096

6144 8192

TOps

/s

Problem Size (m=n=k)

DGEMM

cuBLASOzaki(p=0)Ozaki(p=1)

0 20 40 60 80

100 120 140 160 180

0 2048 4096

6144 8192

10240

GO

ps/s

Problem Size (m=n)

DGEMV

cuBLASOzaki(p=0)Ozaki(p=1)

Ops: number of mathematical computations of the BLAS operationInput’s exponent range (min/max): -10/-1 (p=0), -9/+2 (p=1)

Execution time overhead v.s. cuBLAS FP64 routine depends on the input(the range of the absolute values of the input vectors/matrices)

e.g., more than 10x on DOT & GEMV and 4x on GEMM in above cases at minimum

Bet

ter

Page 17: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Accuracy (tunable-accuracy version)

17

Input (min/max exp) φ=1 (-07/+01) φ=4 (-10/+07) φ=7 (-17/+13) φ=10 (-24/+19)

cuBLAS (double) 6.14E-10 2.95E-11 2.22E-10 2.27E-10

OzBLAS (d=2, fast)OzBLAS (d=2)

4.75E-051.98E-06

3.32E-024.61E-03

3.92E+014.11E+01

5.55E+065.52E+06

OzBLAS (d=3, fast)OzBLAS (d=3)

6.28E-125.07E-13

1.80E-081.17E-09

2.08E-044.42E-05

8.99E+006.82E+00

OzBLAS (d=4, fast)OzBLAS (d=4)

00

7.42E-158.25E-16

2.87E-107.79E-11

3.65E-042.78E-04

OzBLAS (d=6, fast)OzBLAS (d=6)

00

00

00

4.37E-160

Maximum relative error to MPFR (2048-bit)(GEMM, m=n=k=1000, input: (rand-0.5)×exp(φ×randn))

• Accuracy depends on (1) the range of the absolute values in the inputs, (2) the dimension (for dot-product-wise), (3) the number of split-matrices (d)

• Ozaki scheme can be less accurate than cublasDgemm (shown in red)

†The results were compared after rounded to double-precision

Page 18: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Reproducibility

18

Results with various implementations(DOT, n=10000)

CPU (Xeon E5-2623v4) GPU (Titan V)MPFR (2048-bit) 0x1.33b6d7b84f15cp+7 N/AcuBLAS (double) N/A 0x1.33b6d7b84f15bp+7MKL (double) 0x1.33b6d7b84f15ep+7 N/AOzBLAS (d=1) 0x1.33b4bbaep+7 0x1.33b4bbaep+7OzBLAS (d=2) 0x1.33b6d7b83efffp+7 0x1.33b6d7b83efffp+7OzBLAS (d=3) 0x1.33b6d7b84f15cp+7 0x1.33b6d7b84f15cp+7

OzBLAS:• CPU version relies on MKL and GPU version relies on cuBLAS• when d<3, the results are not correct but reproducible

between CPU and GPU

Page 19: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

cuOzBLAS (NVIDIA Titan V)

19

0

20

40

60

80

100

2 18 2 19 2 20 2 21 2 22 2 23 2 24 2 25

%

Problem size (n)

DDOT on Titan V (d=6)

0

20

40

60

80

100

20483072

40965120

61447168

81929216

10240

%

Problem size (m=n)

DGEMV on Titan V (d=6)

0

20

40

60

80

100

10241536

20482560

30723584

40964608

5120

%

Problem size (m=n=k)

DGEMM on Titan V (d=6)SplitASplitBComp

SumOther

0 5

10 15 20 25 30 35 40 45

218 220 222 224

Ove

rhea

d to

cuB

LAS

Problem Size (n)

DDOT on Titan V

0 5

10 15 20 25 30 35 40 45

2048 4096 6144 8192 10240

Ove

rhea

d to

cuB

LAS

Problem Size (m=n)

DGEMV on Titan V

0 5

10 15 20 25 30 35 40 45

1024 2048 3072 4096 5120

Ove

rhea

d to

cuB

LAS

Problem Size (m=n=k)

DGEMM on Titan Vd=2

d=2,fastd=3

d=3,fastd=4

d=4,fastd=6

d=6,fast

0

100

200

300

400

500

600

700

218 220 222 224

GB/

s

Problem Size (n)

DDOT on Titan V

0

100

200

300

400

500

600

700

2048 4096 6144 8192 10240

GB/

s

Problem Size (m=n)

DGEMV on Titan V

0

1000

2000

3000

4000

5000

6000

7000

1024 2048 3072 4096 5120

GFl

ops

Problem Size (m=n=k)

DGEMM on Titan VcuBLAS

d=2d=2,fast

d=3d=3,fast

d=4d=4,fast

d=6d=6,fast

Page 20: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

OzBLAS (Intel Xeon W-2123)

20

0

20

40

60

80

100

262144

524288

1048576

2097152

4194304

8388608

16777216

33554432

%

DDOT on Xeon Wï2123 (d=6)

0

20

40

60

80

100

20483072

40965120

61447168

81929216

10240

%

Problem size (m=n)

DGEMV on Xeon Wï2123 (d=6)

0

20

40

60

80

100

10241536

20482560

30723584

40964608

5120

%

Problem size (m=n=k)

DGEMM on Xeon Wï2123 (d=6)SplitASplitBComp

SumOther

0 10 20 30 40 50 60 70 80

218 220 222 224

Ove

rhea

d to

MKL

Problem Size (n)

DDOT on Xeon Wï2123

0 10 20 30 40 50 60 70 80

2048 4096 6144 8192 10240

Ove

rhea

d to

MKL

Problem Size (m=n)

DGEMV on Xeon Wï2123

0 10 20 30 40 50 60 70 80

1024 2048 3072 4096 5120

Ove

rhea

d to

MKL

Problem Size (m=n=k)

DGEMM on Xeon Wï2123d=2

d=2,fastd=3

d=3,fastd=4

d=4,fastd=6

d=6,fast

0

20

40

60

80

100

120

140

160

218 220 222 224

GB/

s

Problem Size (n)

DDOT on Xeon Wï2123

0 5

10 15 20 25 30 35 40 45 50

2048 4096 6144 8192 10240

GB/

s

Problem Size (m=n)

DGEMV on Xeon Wï2123

0

50

100

150

200

250

300

350

400

1024 2048 3072 4096 5120

GFl

ops

Problem Size (m=n=k)

DGEMM on Xeon Wï2123MKLd=2

d=2,fastd=3

d=3,fastd=4

d=4,fastd=6

d=6,fast

Page 21: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

0

0.2

0.4

0.6

0.8

1

1.2

0 2048 4096 6144 8192

TFlo

ps (o

n D

P)

Problem Size (m=n=k)

DGEMMïTC (CR) on Tesla V100

phi=0phi=1phi=2

Accurate DGEMM using Tensor Cores

21

Tesla V100• 7 TFlops on FP64• 113 TFlops TC

Throughput (TFlops on FP64) on Tesla V100 (Volta), CUDA 10.1Input’s exponent range (min/max): -10/-1 (p=0), -9/+2 (p=1), -9/+4 (p=2)

With Ozaki scheme, accurate DGEMM can be built upon cublasGemmEx using Tensor Cores !! [5] (a paper incl. some extensions will be published soon…)

[5] D. Mukunoki, K. Ozaki, T. Ogita: Accurate DGEMM using Tensor Cores, HPC Asia Poster session, Jan. 2020http://sighpc.ipsj.or.jp/HPCAsia2020/hpcasia2020_poster_abstracts/poster_abstract_09.pdf

Page 22: Accurate BLAS Implementations: OzBLASand BLAS-DOT2 · 2020. 2. 26. · (FP16, FP128) Fast Accurate Methods / Libraries others… LAPACK BLAS MPFR Stochastic Arithmetic CADNA SAM PROMISE

Conclusionn Summary

l Two accurate BLAS implementations: OzBLAS and BLAS-DOT21. BLAS-DOT2 [1]: 2-fold precision for inner product with Dot2 algorithm2. OzBLAS [3]: Tunable accuracy (incl. correct-rounding) with Ozaki schemel Both are interface compatible with standard BLAS for FP64 data, but the

computation is performed with an accurate methodl Code: http://www.math.twcu.ac.jp/ogita/post-k/results.html

n Future workl Full-set implementationl For high-precision inputs:

¥ BLAS-Dot2 -> BLAS-DD (already available as MBLAS and QPBLAS)¥ Ozaki scheme for binary128 and more

l Demo of the use in minimal-precision computing

22