USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP...

Jeff Larkin, December 04, 2018

USING VOLTA TENSOR CORES

VOLTA TENSOR CORE

TENSOR COREMixed Precision Matrix Math4x4 matrices

D = AB + C

FP16 or FP32 FP16 FP16 FP16 or FP32

A0,0 A0,1 A0,2 A0,3

A1,0 A1,1 A1,2 A1,3

A2,0 A2,1 A2,2 A2,3

A3,0 A3,1 A3,2 A3,3

B0,0 B0,1 B0,2 B0,3

B1,0 B1,1 B1,2 B1,3

B2,0 B2,1 B2,2 B2,3

B3,0 B3,1 B3,2 B3,3

C0,0 C0,1 C0,2 C0,3

C1,0 C1,1 C1,2 C1,3

C2,0 C2,1 C2,2 C2,3

C3,0 C3,1 C3,2 C3,3

TENSOR CORE COORDINATION

Warp-synchronizing operation for cooperative matrix math

Full Warp 16x16 Matrix Math

Aggregate Matrix Multiply and Accumulate for 16x16 matrices

Result distributed across warp

VOLTA TENSOR OPERATION

storage/input

Full precision

product

Sum with

accumulator

Convert to

FP32 result

Also supports FP16 accumulator mode for inferencing

more products

A GIANT LEAP FOR DEEP LEARNING

tive P

P100 V100 – Tensor Cores

9.3xfaster

cuBLAS Mixed Precision (FP16 input, FP32 compute)

Matrix Multiply (M=N=K=2048)

(CUDA 8) (CUDA 9)

V100 measured on pre-production hardware.

USING TENSOR CORES

Volta Optimized Frameworks and Libraries

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

CUDA C++

Warp-Level Matrix Operations

NVIDIA cuDNN, cuBLAS, TensorRT

TENSOR CORES FROM CUBLAS

CUBLAS provides an extended BLAS interface for mixed precisions

cublasGemmEx(), cublasGemmBatchedEx(), and cublasGemmStridedBatchedEx() all accept

FP16 data types for Tensor Core use.

Some solvers are also available in an Ex form.

See https://docs.nvidia.com/cuda/cublas/index.html#cublas-GemmEx

CUDA TENSOR CORE PROGRAMMINGNew WMMA datatypes

wmma::fragment<matrix_a, …> Amat;

Per-Thread fragments to hold components of matrices for use with Tensor Cores

CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations

wmma::load_matrix_sync(Amat, a, stride);

Warp-level operation to fetch components of matrices into fragments

CUDA TENSOR CORE PROGRAMMINGNew WMMA Matrix Multiply and Accumulate Operation

wmma::mma_sync(Dmat, Amat, Bmat, Cmat);

Warp-level operation to perform matrix multiply and accumulate

CUDA TENSOR CORE PROGRAMMINGNew WMMA load and store operations

wmma::store_matrix_sync(d, Dmat, stride);

Warp-level operation to fetch components of matrices into fragments

Result

TENSOR CORE EXAMPLE

__device__ void tensor_op_16_16_16(float *d, half *a, half *b, float *c)

{wmma::fragment<matrix_a, …> Amat;wmma::fragment<matrix_b, …> Bmat;wmma::fragment<matrix_c, …> Cmat;

wmma::load_matrix_sync(Amat, a, 16);wmma::load_matrix_sync(Bmat, b, 16);wmma::fill_fragment(Cmat, 0.0f);

wmma::mma_sync(Cmat, Amat, Bmat, Cmat);

wmma::store_matrix_sync(d, Cmat, 16,wmma::row_major);

CUDA C++

Warp-Level Matrix Operations

Create Fragments

Initialize Fragments

Perform MatMul

Store Results

USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP...

Documents

Transcript of USING VOLTA TENSOR CORES › wp-content › uploads › 2018 › 12 › ...6 A GIANT LEAP FOR DEEP...

Tensor 18kN - Kvalitest€¦ · Tensor TE6-900 Tensor 18kN U.S. Patent: 6 860 152 China Patent: ZL 03 809 374.X Japan Patent: 4 217 210 Tensor TE6-900 Tensor 900 Tensor 18kN Sine

NVIDIA’S XAVIER SOC - Hot Chips · 2018. 8. 19. · ©2018 NVIDIA CORPORATION 4 Volta Tensor Core GPU FP32 / FP16 / INT8 Multi-Precision 512 CUDA Tensor Cores 2.8 CUDA TFLOPS (FP16)

A Case Study on Optimizing Accurate Half Precision Average · - k-means, meanshift, average pooling FP16 hardware support incoming - Pascal GPU, ARM SVE FP16 precision imposes serious

JETSON XAVIER NX - NVIDIA€¦ · 9 a re JETSON NANO JETSON TX2 JETSON XAVIER NX JETSON AGX XAVIER GPU 128 Core Maxwell 0.5 TFLOPs (FP16) 256 Core Pascal 1.3 TFLOPS (FP16) 384 Core

CUDA CUBLAS Library - pudn.comread.pudn.com/downloads207/ebook/975474/CUBLAS_Library_2.1.pdf · NVIDIA Corporation CUBLAS Library PG-00000-002_V2.1 Published by NVIDIA Corporation

Tensor-Tensor Products with Invertible Linear Transformsshuchin/els_cosine_7_10_15.pdf · Tensor-Tensor Products with Invertible Linear Transforms Eric Kernfelda, Misha Kilmerb, Shuchin

CUDA CUBLAS Library - Arecibo Observatory · NVIDIA Corporation CUBLAS Library PG-00000-002_V2.3 Published by NVIDIA Corporation 2701 San Tomas Expressway Santa Clara, CA 95050 Notice

The CUBLAS and CULA libraries - GitHub Pages · The CUBLAS and CULA libraries Will Landau CUBLAS overview Using CUBLAS The CUBLAS and CULA libraries CULA Will Landau Iowa State University

CUDA Toolkit and Libraries - Hot Chips · #ifdef CUBLAS ! Call SGEMM in CUBLAS library using THUNKING interface (library takes care of ! memory allocation on device and data movement)

CUDA CUBLAS Library - Nc State Universitymoss.csc.ncsu.edu/.../nvidia/1.1/CUBLAS_Library_1.1.pdfPG-00000-002_V1.1 1 NVIDIA CHAPTER1 The CUBLAS Library CUBLAS is an implementation of

Tensor Processing Unit Tensor Proces

Harnessing GPU Tensor Cores for Fast FP16 Arithmetic to Speed …eprints.maths.manchester.ac.uk/2664/1/International ACM... · 2018. 10. 2. · (FP64, FP128). An investigation of

CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

tensor calculus 02 - tensor calculus - tensor algebrabiomechanics.stanford.edu/me337/me337_s02.pdf · 02 - tensor calculus - tensor algebra tensor calculus 2 tensor the word tensor

CUDA CUBLAS Library - Université de Montréaldms.umontreal.ca/downloads/CUDA/CUBLAS_Library.pdf · NVIDIA Corporation CUBLAS Library PG-05326-032_V01 Portions of the SGEMM, DGEMM

CUDA CUBLAS Library

CUDA CUBLAS Library - National Tsing Hua Universityoz.nthu.edu.tw/~d947207/NVIDIA/CUBLAS_Library_2.0.pdfNVIDIA Corporation CUBLAS Library PG-00000-002_V2.0 Published by NVIDIA Corporation

Evaluation and Tuning of Level 3 CUBLAS for Graphics ...

FELLER ENGINEERING · 2016. 10. 18. · FELLER ENGINEERING GmbH FP16 Operator Manual Version 6.2 Page 5 2 Basic overview The series FP16 is designed for temperature controllers in

CUDA CUBLAS Library - RUC.dkdirac.ruc.dk/manuals/cuda-3.0/CUBLAS_Library_3.0.pdf · NVIDIA Corporation CUBLAS Library PG-00000-002_V3.0 Portions of the SGEMM, DGEMM and ZGEMM library