Tensor Contractions with Extended BLAS Kernels on CPU...

45
Tensor Contractions with Extended BLAS Kernels on CPU and GPU Cris Cecka Senior Research Scientist NVIDIA Research, Santa Clara, California Joint work with Yang Shi, U.N. Niranjan, and Animashree Anandkumar Electrical Engineering and Computer Science University of California, Irvine, California SIAM CSE 2017 July 10, 2017 Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 1 / 30

Transcript of Tensor Contractions with Extended BLAS Kernels on CPU...

Page 1: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contractions with Extended BLAS Kernels on

CPU and GPU

Cris Cecka

Senior Research ScientistNVIDIA Research, Santa Clara, California

Joint work with Yang Shi, U.N. Niranjan, and Animashree Anandkumar

Electrical Engineering and Computer ScienceUniversity of California, Irvine, California

SIAM CSE 2017

July 10, 2017

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 1 / 30

Page 2: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 2 / 30

Page 3: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

Modern data is inherently multi-dimensional

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 2 / 30

Page 4: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

Modern data is inherently multi-dimensional

Input Hidden 1 Hidden 2 Output

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 2 / 30

Page 5: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

Modern data is inherently multi-dimensional

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 2 / 30

Page 6: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

What is tensor contraction?

CC = AA BB

e.g. Cmnp = Amnk Bkp

Why do we need tensor contraction?

1 Core primitive of multilinear algebra.

2 BLAS Level 3: Unbounded compute intensity.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 3 / 30

Page 7: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

What is tensor contraction?

CC = AA BB

e.g. Cmnp = Amnk Bkp

Why do we need tensor contraction?

1 Core primitive of multilinear algebra.

2 BLAS Level 3: Unbounded compute intensity.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 3 / 30

Page 8: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

What is tensor contraction?

CC = AA BB

=

=

A(:,1,:) A(:,2,:)A422

B21

C421

e.g. Cmnp = Amnk Bkp

Why do we need tensor contraction?

1 Core primitive of multilinear algebra.

2 BLAS Level 3: Unbounded compute intensity.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 3 / 30

Page 9: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

What is tensor contraction?

CC = AA BB

=

=

A(:,1,:) A(:,2,:)A422

B21

C421

e.g. Cmnp = Amnk Bkp

Why do we need tensor contraction?

1 Core primitive of multilinear algebra.

2 BLAS Level 3: Unbounded compute intensity.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 3 / 30

Page 10: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction – Motivation

Lots of hot applications at the moment:

Machine learningDeep learninge.g. Learning latent variable model with tensor decomposition:

Topic model 1

h: PDF of topics in a document.A: Topic-word matrix.

Aij = P(xm = i |ym = j)

Form third-order tensor M3 = E(x ⌦ x ⌦ x) =P

i hiai ⌦ ai ⌦ ai

1Tensor Decompositions for Learning Latent Variable Models, Anima Anandkumar,

Rong Ge, Daniel Hsu et. al.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 4 / 30

Page 11: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction – Motivation

Lots of hot applications at the moment:

Machine learningDeep learninge.g. Learning latent variable model with tensor decomposition:

Topic model 1

h: PDF of topics in a document.A: Topic-word matrix.Aij = P(xm = i |ym = j)

Form third-order tensor M3 = E(x ⌦ x ⌦ x) =P

i hiai ⌦ ai ⌦ ai

1Tensor Decompositions for Learning Latent Variable Models, Anima Anandkumar,

Rong Ge, Daniel Hsu et. al.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 4 / 30

Page 12: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction – Motivation

Lots of hot applications at the moment:

Machine learningDeep learninge.g. Learning latent variable model with tensor decomposition:

Topic model 1

h: PDF of topics in a document.A: Topic-word matrix.Aij = P(xm = i |ym = j)

Form third-order tensor M3 = E(x ⌦ x ⌦ x) =P

i hiai ⌦ ai ⌦ ai

1Tensor Decompositions for Learning Latent Variable Models, Anima Anandkumar,

Rong Ge, Daniel Hsu et. al.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 4 / 30

Page 13: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction – Motivation

Distributed FFT

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 5 / 30

Page 14: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction – Motivation

Distributed FFT

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 6 / 30

Page 15: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

What do we have?

Tensor computation libraries

1 Arbitrary/restricted tensoroperation of any order anddimension

1 Tensortoolbox (Matlab)2 FTensor (C++)3 Cyclops (C++)4 BTAS (C++)5 All the Python...

E�cient computing frame

1 Static analysis solutions1 PPCG [ISL] (polyhedral)2 TCE (DSL)

2 Parallel and distributed primitives1 BLAS, cuBLAS2 BLIS, BLASX, cuBLASXT

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 7 / 30

Page 16: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

What do we have?

Tensor computation libraries

1 Arbitrary/restricted tensoroperation of any order anddimension

1 Tensortoolbox (Matlab)2 FTensor (C++)3 Cyclops (C++)4 BTAS (C++)5 All the Python...

E�cient computing frame

1 Static analysis solutions1 PPCG [ISL] (polyhedral)2 TCE (DSL)

2 Parallel and distributed primitives1 BLAS, cuBLAS2 BLIS, BLASX, cuBLASXT

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 7 / 30

Page 17: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

What do we have?

Tensor computation libraries

1 Arbitrary/restricted tensoroperation of any order anddimension

1 Tensortoolbox (Matlab)2 FTensor (C++)3 Cyclops (C++)4 BTAS (C++)5 All the Python...

E�cient computing frame

1 Static analysis solutions1 PPCG [ISL] (polyhedral)2 TCE (DSL)

2 Parallel and distributed primitives1 BLAS, cuBLAS2 BLIS, BLASX, cuBLASXT

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 7 / 30

Page 18: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

Libraries

Explicit permutation dominates.

Consider Cmnp = Akm Bpkn.

1Akm ! Amk

2Bpkn ! Bkpn

3Cmnp ! Cmpn

4Cm(pn) = Amk Bk(pn)

5Cmpn ! Cmnp

(Top) CPU. (Bottom) GPU. The fraction of timespent in copies/transpositions. Lines are shown with1, 2, 3, and 6 transpositions.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 8 / 30

Page 19: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

Libraries

Explicit permutation dominates.

Consider Cmnp = Akm Bpkn.

1Akm ! Amk

2Bpkn ! Bkpn

3Cmnp ! Cmpn

4Cm(pn) = Amk Bk(pn)

5Cmpn ! Cmnp

(Top) CPU. (Bottom) GPU. The fraction of timespent in copies/transpositions. Lines are shown with1, 2, 3, and 6 transpositions.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 8 / 30

Page 20: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction-Motivation

Libraries

Explicit permutation dominates.

Consider Cmnp = Akm Bpkn.

1Akm ! Amk

2Bpkn ! Bkpn

3Cmnp ! Cmpn

4Cm(pn) = Amk Bk(pn)

5Cmpn ! Cmnp

100 200 300 400 5000

0.2

0.4

0.6

0.8

1

n

(Top) CPU. (Bottom) GPU. The fraction of timespent in copies/transpositions. Lines are shown with1, 2, 3, and 6 transpositions.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 8 / 30

Page 21: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Existing Primitives

GEMM

Suboptimal for many small matrices.

Pointer-to-Pointer BatchedGEMM

Available in MKL 11.3� and cuBLAS 4.1

C [p] = ↵ op(A[p]) op(B[p]) + � C [p]

cublas<T>gemmBatched(cublasHandle_t handle,cublasOperation_t transA, cublasOperation_t transB,int M, int N, int K,const T* alpha,const T** A, int ldA,const T** B, int ldB,const T* beta,T** C, int ldC,int batchCount)

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 9 / 30

Page 22: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Existing Primitives

Pointer-to-Pointer BatchedGEMM

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 10 / 30

Page 23: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Existing Primitives

Pointer-to-Pointer BatchedGEMM

Except actually...

Solution: StridedBatchedGEMM

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 11 / 30

Page 24: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

StridedBatchedGEMM

Exists!

... Still no documentation?!?

Documentation as of last Tuesday!

In cuBLAS 8.0:

$$ grep StridedBatched -A 17 /usr/local/cuda/include/cublas_api.h2320:CUBLASAPI cublasStatus_t cublasSgemmStridedBatched (cublasHandle_t handle,2321- cublasOperation_t transa,2322- cublasOperation_t transb,2323- int m,2324- int n,2325- int k,2326- const float *alpha, // host or device pointer2327- const float *A,2328- int lda,2329- long long int strideA, // purposely signed2330- const float *B,2331- int ldb,2332- long long int strideB,2333- const float *beta, // host or device pointer2334- float *C,2335- int ldc,2336- long long int strideC,2337- int batchCount);...

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 12 / 30

Page 25: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

StridedBatchedGEMM

Exists!

... Still no documentation?!?

Documentation as of last Tuesday!

In cuBLAS 8.0:

$$ grep StridedBatched -A 17 /usr/local/cuda/include/cublas_api.h2320:CUBLASAPI cublasStatus_t cublasSgemmStridedBatched (cublasHandle_t handle,2321- cublasOperation_t transa,2322- cublasOperation_t transb,2323- int m,2324- int n,2325- int k,2326- const float *alpha, // host or device pointer2327- const float *A,2328- int lda,2329- long long int strideA, // purposely signed2330- const float *B,2331- int ldb,2332- long long int strideB,2333- const float *beta, // host or device pointer2334- float *C,2335- int ldc,2336- long long int strideC,2337- int batchCount);...

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 12 / 30

Page 26: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

StridedBatchedGEMM

cublas<T>gemmStridedBatched(cublasHandle_t handle,cublasOperation_t transA, cublasOperation_t transB,int M, int N, int K,const T* alpha,const T* A, int ldA1, int strideA,const T* B, int ldB1, int strideB,const T* beta,T* C, int ldC1, int strideC,int batchCount)

Common use case for Pointer-to-pointer BatchedGEMM.

No Pointer-to-pointer data structure or overhead.

Performance on par with pure GEMM (P100 and beyond).

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 13 / 30

Page 27: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction with Extended BLAS Primitives

Cmnp = A⇤⇤ ⇥ B⇤⇤⇤

Cmnp ⌘ C [m + n · ldC1 + p · ldC2]

Case Contraction Kernel1 Kernel2 Case Contraction Kernel1 Kernel2

1.1 AmkBknp Cm(np) = AmkBk(np) Cmn[p] = AmkBkn[p] 4.1 AknBkmp Cmn[p] = B

>km[p]Akn

1.2 AmkBkpn Cmn[p] = AmkBk[p]n Cm[n]p = AmkBkp[n] 4.2 AknBkpm Cmn[p] = B

>k[p]mAkn

1.3 AmkBnkp Cmn[p] = AmkB>nk[p] 4.3 AknBmkp Cmn[p] = Bmk[p]Akn

1.4 AmkBpkn Cm[n]p = AmkB>pk[n] 4.4 AknBpkm

1.5 AmkBnpk Cm(np) = AmkB>(np)k Cmn[p] = AmkB

>n[p]k 4.5 AknBmpk Cmn[p] = Bm[p]kAkn

1.6 AmkBpnk Cm[n]p = AmkB>p[n]k 4.6 AknBpmk

2.1 AkmBknp Cm(np) = A

>kmBk(np) Cmn[p] = A

>kmBkn[p] 5.1 ApkBkmn C(mn)p = B

>k(mn)A

>pk Cm[n]p = B

>km[n]A

>pk

2.2 AkmBkpn Cmn[p] = A

>kmBk[p]n Cm[n]p = A

>kmBkp[n] 5.2 ApkBknm Cm[n]p = B

>k[n]mA

>pk

2.3 AkmBnkp Cmn[p] = A

>kmB

>nk[p] 5.3 ApkBmkn Cm[n]p = Bmk[n]A

>pk

2.4 AkmBpkn Cm[n]p = A

>kmB

>pk[n] 5.4 ApkBnkm

2.5 AkmBnpk Cm(np) = A

>kmB

>(np)k Cmn[p] = A

>kmB

>n[p]k 5.5 ApkBmnk C(mn)p = B(mn)kA

>pk Cm[n]p = Bm[n]kA

>pk

2.6 AkmBpnk Cm[n]p = A

>kmB

>p[n]k 5.6 ApkBnmk

3.1 AnkBkmp Cmn[p] = B

>km[p]A

>nk 6.1 AkpBkmn C(mn)p = B

>k(mn)Akp Cm[n]p = B

>km[n]Akp

3.2 AnkBkpm Cmn[p] = B

>k[p]mA

>nk 6.2 AkpBknm Cm[n]p = B

>k[n]mAkp

3.3 AnkBmkp Cmn[p] = Bmk[p]A>nk 6.3 AkpBmkn Cm[n]p = Bmk[n]Akp

3.4 AnkBpkm 6.4 AkpBnkm

3.5 AnkBmpk Cmn[p] = Bm[p]kA>nk 6.5 AkpBmnk C(mn)p = B(mn)kAkp Cm[n]p = Bm[n]kAkp

3.6 AnkBpmk 6.6 AkpBnmk

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 14 / 30

Page 28: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction with Extended BLAS Primitives

Case Contraction Kernel1 Kernel2 Kernel3

1.1 AmkBknp Cm(np) = AmkBk(np) Cmn[p] = AmkBkn[p] Cm[n]p = AmkBk[n]p

6.1 AkpBkmn C(mn)p = B

>k(mn)Akp Cm[n]p = B

>km[n]Akp

Example: Mappings to Level 3 BLAS routines

Case 1.1, Kernel2: Cmn[p] = AmkBkn[p]

cublasDgemmStridedBatched(handle,CUBLAS_OP_N, CUBLAS_OP_N,M, N, K,&alpha,A, ldA1, 0,B, ldB1, ldB2,&beta,C, ldC1, ldC2,P)

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 15 / 30

Page 29: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Tensor Contraction with Extended BLAS Primitives

Case Contraction Kernel1 Kernel2 Kernel3

1.1 AmkBknp Cm(np) = AmkBk(np) Cmn[p] = AmkBkn[p] Cm[n]p = AmkBk[n]p

6.1 AkpBkmn C(mn)p = B

>k(mn)Akp Cm[n]p = B

>km[n]Akp

Example: Mappings to Level 3 BLAS routines

Case 6.1, Kernel2: Cm[n]p = B

>km[n]Akp

cublasDgemmStridedBatched(handle,CUBLAS_OP_T, CUBLAS_OP_N,M, P, K,&alpha,B, ldB1, ldB2,A, ldA1, 0,&beta,C, ldC2, ldC1,N)

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 16 / 30

Page 30: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Performance

Flatten V.S. SBGEMM

0 100 200 300 400 500

1

2

3

n

Flat

teni

ngSp

eedu

p(B

atch

/Fla

t)

Case 1.1 [n]Case 1.1 [p]Case 1.5 [p]Case 6.1 [n]

0 100 200 300 400 500

1

2

3

n

Prefer flattening to “pure” GEMM.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 17 / 30

Page 31: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Performance

Batching in last mode versus middle mode

0 100 200 300 400 500

0.9

1

1.1

1.2

n

Last

Mod

eSp

eedu

p([n

]/[p]

)

0 100 200 300 400 500

0.9

1

1.1

1.2

n

Case 1.1Case 2.1

On CPU: Prefer batching in the last mode.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 18 / 30

Page 32: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Performance

Mixed mode batching

0 100 200 300 400 500

0.9

1

1.1

1.2

n

Last

Out

putM

ode

Spee

dup

([n]

/[p]

)

0 100 200 300 400 500

0.9

1

1.1

1.2

n

Case 1.2Case 2.2

On CPU: mode of the output tensor is more important than the batchingmode of the input tensor.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 19 / 30

Page 33: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Analysis

Exceptional Cases:

Cannot be computed by StridedBatchedGEMM.

Case Contraction

3.4 Cmnp = AnkBpkm

6.4 Cmnp = AkpBnkm

Cmnp = AmkpBmkn

Cmnp = ApkmBnkp

Example of exceptional cases.

These cases are precisely the interleaved GEMMs.

When batching index is the major index in an argument:

That argument is interpreted as interleaved matrices.May be one or both inputs and/or output.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 20 / 30

Page 34: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

3D Tiled GEMM

Implement GEMM with a 3D tile:

Transpositions performed on the way to smem/reg.

Keep canonical GEMM core.

Considers three modes rather than two:

Major mode: A

mnkpqr

Reduction mode: Amnkpqr

Aux (batch,row,col) mode: Amnkpqr (Optional)

Third tile dimension interpolates between pure GEMM andinterleaved GEMM.

Nested loop over remaining modes performs full contraction.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 21 / 30

Page 35: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

3D Tiled GEMM

Tilesize tuning with PPCG for exceptional cases:

(1,1) (2,1) (4,1) (8,1) (16,1) (32,1) (64,1) (128,1)(1,1)

(2,1)

(4,1)

(8,1)

(16,1)

(32,1)

(64,1)

(128,1)

Blocking (m,n)B

lock

ing(p,k)

4.5

5

5.5

6

6.5

0 50 100 150 200 250101

102

103

104

105

106

n

Tim

e[µ

s]

PPCGBATCHEDGEMV

BATCHEDGEMM

GEAM

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 22 / 30

Page 36: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

3D Tiled GEMM

Cmnp = Amkp Bnkp: Increasing BLK P decreases e↵ective tile size.Cpmn = Apmk Bpnk : Increasing BLK P increases cache line utilization.

e.g. BLK P= 1, 2, 4, 8BLK P = 1 equivalent to BLIS (strides in row and column)

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 23 / 30

Page 37: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

3D Tiled GEMM – Interface?

Extend the StridedBatchedGEMM transpose parameters?

X Cmnp Apmk Bpkn EX N EX N

X Cmpn = Apmk Bpnk EX N EX T

7 Cpmn Apkm Bpkn EX T EX N

Apkm Bpnk EX T EX T

E.g. Cmn[p] = Am[p]k B[p]nk

cublasDgemmStridedBatched(handle,CUBLAS_OP_N, CUBLAS_OP_EX_T,M, N, K,&alpha,A, ldA2, ldA1,B, ldB1, ldB2,&beta,C, ldC1, ldC2,P);

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 24 / 30

Page 38: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

contract

contract(cublas::par,

alpha,

A, {M,P,K}, _<’m’,’p’,’k’>,

B, {K,N,P}, _<’k’,’n’,’p’>,

beta,

C, _<’m’,’n’,’p’>);

ROWIDX COLIDX BATIDX REDIDX Kernel e.g.0 0 0 0 mult C = A B0 0 0 1 dot C = Ak Bk0 0 1 0 XXX (scalar-mult) Cp = Ap Bp0 0 1 1 XXX (nested-dot?) Cp = Akp Bkp0 1 0 0 scal Cn = A Bn0 1 0 1 gemv Cn = Ak Bkn0 1 1 0 dgmm (cublas?) Cnp = Ap Bnp0 1 1 1 batch gemm (m=1?) Cnp = Akp Bnkp1 0 0 0 scal Cm = Am B1 0 0 1 gemv Cm = Amk Bk1 0 1 0 dgmm (cublas?) Cmp = Amp Bp1 0 1 1 batch gemm (n=1?) Cmp = Amkp Bkp1 1 0 0 ger Cmn = Am Bn1 1 0 1 gemm Cmn = Amk Bkn1 1 1 0 batch gemm (k=1?) Cmnp = Amp Bnp1 1 1 1 batch gemm Cmnp = Amkp Bnkp

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 25 / 30

Page 39: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Applications: Tucker Decomposition

Tmnp = GijkAmiBnjCpk

mnp ijk mi

njT GA

B

pkC

Main steps in the algorithm

Ymjk = TmnpBtnjC

tpk

Yink = TmnpAt+1mi C

tpk

Yijp = TmnpBt+1nj A

t+1mi

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 26 / 30

Page 40: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Applications: Tucker Decomposition

Tmnp = GijkAmiBnjCpk

mnp ijk mi

njT GA

B

pkC Main steps in the algorithm

Ymjk = TmnpBtnjC

tpk

Yink = TmnpAt+1mi C

tpk

Yijp = TmnpBt+1nj A

t+1mi

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 26 / 30

Page 41: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Applications: Tucker Decomposition

Performance on Tucker decomposition:

20 40 60 80 100 12010�2

100

102

104

106

n

Tim

e(s

ec)

TensorToolboxBTAS

CyclopsCPU BatchedGPU Batched

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 27 / 30

Page 42: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Applications: FFT

Low-Communication FFT for multiple GPUs.

StridedBatchedGEMM composes 75%+ of the runtime.

Essential to the performance.Two custom kernels are precisely interleaved GEMMs.

2 P100 GPUs: 1.3x over cuFFTXT.

8 P100 GPUs: 2.1x over cuFFTXT.

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 28 / 30

Page 43: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Conclusion

StridedBatchedGEMM in cuBLAS for generalized tensor contractions.

Avoid explicit transpositions or permutations.

10x(GPU) and 2x(CPU) speedup on small/moderate sized tensors.

Available in cuBLAS 8.0

Future work:

Exceptional case kernels/performance/interface??Library Optimizations

Matrix stride zero – Persistent Matrix Strided Batched GEMMStaged – RNNs: Staged Persistent Matrix Strided Batched GEMM

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 29 / 30

Page 44: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Conclusion

StridedBatchedGEMM in cuBLAS for generalized tensor contractions.

Avoid explicit transpositions or permutations.

10x(GPU) and 2x(CPU) speedup on small/moderate sized tensors.

Available in cuBLAS 8.0

Future work:

Exceptional case kernels/performance/interface??Library Optimizations

Matrix stride zero – Persistent Matrix Strided Batched GEMMStaged – RNNs: Staged Persistent Matrix Strided Batched GEMM

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 29 / 30

Page 45: Tensor Contractions with Extended BLAS Kernels on CPU …ccecka.com/resources/Slides/Tensor_BLAS_SIAM_CSE_2017.pdf · Tensor Contractions with Extended BLAS Kernels on CPU and GPU

Thank you!

Questions?

Cris Cecka (NVIDIA) Tensor Contractions cuBLAS July 10, 2017 30 / 30