BLAS AT NVIDIA - ICLicl.utk.edu/bblas/sc18/files/BBLAS_BoF.pdf · 2018-11-15 · CUTLASS MOTIVATION...

Cris Cecka

BLAS AT NVIDIA

AGENDA

CUTLASS 1.2

cuBlasLt and Matmul

Reduced Precision and Reproducibility

AGENDA CUTLASS 1.2

CUTLASS MOTIVATION

Problem:

Multiplicity of Algorithms and Data Types • GEMM, Convolution, Back propagation

• Mixed precision arithmetic

Kernels specialized for layout and problem size • NT, TN, NCHW, NHWC

Kernel Fusion • Custom operations composed with GEMM and convolution

Solution:

Template Library for Linear Algebra Computations in CUDA C++

• https://github.com/NVIDIA/cutlass

• CUTLASS Parallel for All blog post,

• GTC 2018 CUTLASS talk [video recording]

Data movement and computation primitives

• Iterators, matrix fragments, matrix computations

Inspired by CUB

Productivity Challenges in DLA and Deep Learning

Merrill, Duane. NVIDIA CUB. http://nvlabs.github.io/cub/

GENERIC PROGRAMMING FOR DLA AND DEEP LEARNINGWrite code once to exploit common design patterns • Matrix multiply: grid-, CTA-, warp-, thread-level scope • Tile: bounded region of pitch-linear memory • Tile Iterator: efficient load, store, and traversal over a sequence of tiles • Fragment: partition tile elements among threads

CUDA C++ as a declarative programming language

IMPLEMENTED COMPUTATIONSA B C Accumulator

SGEMM float float float float

half half half, float float

DGEMM double double double double

HGEMM half half half half

IGEMM int8_t int8_t int8_t int32_t

int8_t int8_t float int32_t

WMMA GEMM half half half half

half half half, float float

int8_t int8_t int32_t int32_t

int4_t int4_t int32_t int32_t

bin1_t bin1_t int32_t int32_t

CUTLASS v1.2

+ Batched strided; optimizations for small GEMM

CUTLASS 1.2 PROGRESS

COMPLETE GEMM STRUCTURAL MODELEmbodied by CUTLASS CUDA templates

GemmGlobalIteratorAb

Transformer GemmSharedStoreTileAb

GemmSharedLoadTile{A,B} fma, dp4a, hfma2

WMMATransformer GemmSharedStoreTileD

GemmSharedLoadTileD

GemmGlobalIteratorC Functor

GemmGlobalIteratorD

Global Load Stream Shared Load Stream Matrix Multiply Epilogue

AGENDA cuBLASLt and Matmul

CUBLASLT

Today Future

The GEMM Library

cuBLAS

SASS Kernels

CUTLASS (IMMA, FP16)

CUDA Kernels

cuBLAScuBLASLt

cuBLASLt Context Matmul

SASS Kernels

CUTLASS

cuBLAS_Legacy

cuBLAS Context BLAS 1,2,3 (subset)

CUDA Kernels

MATMUL

cublasStatus_tcublasLtMatmul(cublasLtHandle_t handle, cublasLtMatmulDesc_t computeDesc, const void *alpha, /* host or device pointer */ const void *A, cublasLtMatrixLayout_t Adesc, const void *B, cublasLtMatrixLayout_t Bdesc, const void *beta, /* host or device pointer */ const void *C, cublasLtMatrixLayout_t Cdesc, void *D, cublasLtMatrixLayout_t Ddesc, const cublasLtMatmulAlgo_t *algo, void *workSpace, size_t workSpaceSizeInBytes, cudaStream_t stream) {

Omnibus GEMMcublasLtMatrixLayout_t

cublasLtMatrixLayoutCreate(…)cublasLtMatrixLayoutDestroy(…)cublasLtMatrixLayoutSetAttribute(…)cublasLtMatrixLayoutGetAttribute(…)

cublasLtMatmulDesc_t

cublasLtMatmulDescCreate(…)cublasLtMatmulDescDestroy(…)cublasLtMatmulDescSetAttribute(…)cublasLtMatmulDescGetAttribute(…)

cublasLtMatmulAlgo_t

cublasLtMatmulAlgoInit(…)cublasLtMatmulAlgoFind(…)cublasLtMatmulAlgoGetHeuristic(…)cublasLtMatmulAlgoConfigSetAttribute(…)cublasLtMatmulAlgoConfigGetAttribute(…)

MATMUL ALGO

• Parameters programmability

• Adaptive algorithm heuristics • Construct plans like FFTW_ESTIMATE/FFTW_MEASURE/FFTW_EXHAUSTIVE • Split-K, Reduction Algs • MMA (Tensor cores) • Logical reordering of compute elements (Multi-GPU, Cache-Adaptive, etc) • Math-mode (Atomics, Repro, HiPrec, Gaussian, etc)

Flexible heuristics and implementations

AGENDA Reduced Precision and Reproducibility

REDUCED PRECISION

•cublasGemmEx()•cublasGemmBatchedEx()•cublasGemmStridedBatchedEx()

•cublasLtMatmul()

GemmEx and Matmul

NOTE ON REPRODUCIBILITY

By design, all cuBLAS routines with

• Same toolkit

• Same architecture (same #SMs)

generate the same bit-wise results every run.

For some routines like • cublas<T>symv

• cublas<T>hemv

reproducibility can be sacrificed for efficiency with • cublasSetAtomicsMode()

BLAS AT NVIDIA - ICLicl.utk.edu/bblas/sc18/files/BBLAS_BoF.pdf · 2018-11-15 · CUTLASS MOTIVATION...

Documents

Transcript of BLAS AT NVIDIA - ICLicl.utk.edu/bblas/sc18/files/BBLAS_BoF.pdf · 2018-11-15 · CUTLASS MOTIVATION...

Vol13 - SC18

Interconnect Your Future - files.gpfsug.orgfiles.gpfsug.org/presentations/2018/SC18/SC18 - Mellanox - HPC Advantages.pdfFuture Expansion of Dragonfly+ Based System Topology expansion

[ANGLAIS CAP 2 GEMM ET PMH SEMAINE 2 MME BENELMILI] 13 ...

ISO 19757 - DSDL ISO 19757 – Document Schema Definition Languages (DSDL) Martin Bryan Convenor, JTC1/SC18 WG1.

Gemm Informatica Lasersoft Meeting 2015

Report to API SC18, Task Group 6, Welding...Report to API SC18 on Task Group 6, Welding © 2012 Qualified Specialists, Intl. 2 SC18, TG 6 Charge To review all API product specifications

gemm dreamer - Scene7

2009 Standardization Conference SC18 Committee on Quality Standards San Antonio, Texas

In-Memory Accelerator Architectures for Machine Learning ...sc18.supercomputing.org/proceedings/..."Accelerating K-Means clustering with parallel implementations and GPU computing."

gemm Chrome - Microsoft · Chrome ™ carry cot gemm™ 0 ... WS15 1UZ Assemble Carry Cot see images 1 - 9 1 3 1. Please ensure the 2 support mounts ... After removing mattress pad

SIGMA: A Sparse and Irregular GEMM Accelerator with ...€¦ · 2) A novel accelerator named SIGMA for handling irregular and unstructured sparse GEMM operations. 3) A novel reduction

gemm ramble - Joie United Kingdomcontent.joiebaby.com/.../sites/35/2015/12/2015.11_Ramble_GL.pdf · ramble ™ carry cot gemm™ 0 ... CLEANING: After removing mattress pad from liner,

2011 Standards Conference SC18 Committee on Quality Standards San Francisco June 30, 2011

Performance, Design, and Autotuning of Batched GEMM for ......Performance, Design, and Autotuning of Batched GEMM for GPUs. Ahmad Abdelfattah1(B), Azzam Haidar1, Stanimire Tomov1,

Preliminary Results of Autotuning GEMM Kernels for the - Netlib

How To Befriend NUMA - OpenMP40. Title: sc18.openmp.booth.ruud Created Date: 11/20/2018 12:32:45 PM

SIZZLA - Gemm Chemicals

gemm ramble - Первая-Коляска.РФ · ramble ™ carry cot gemm ™ 0+ (0–13kg) Share the joy at joiebaby.com. 1 2 1. Please check whether 2 support mounts are fixed

2012 Standards Conference SC18 Committee on Quality Standards Westminster, CO June 14, 2012

commuter gemm - Joie Baby UK | Explore Joie