CUBLAS Library Dr. Bo Yuan E-mail: [email protected].

28
CUBLAS Library Dr. Bo Yuan E-mail: [email protected]

Transcript of CUBLAS Library Dr. Bo Yuan E-mail: [email protected].

Page 1: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

CUBLAS LibraryCUBLAS Library

Dr. Bo Yuan

E-mail: [email protected]

Page 2: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

What is CUBLAS Library?What is CUBLAS Library?

• BLAS

– Basic Linear Algebra Subprogram

– A library to perform basic linear algebra

– Divided into three levels

– Such as MKL BLAS,CUBLAS, C++ AMP BLAS……

• CUBLAS

– An high level implementation of BLAS on top of the NVIDIA CUDA

runtime

– Single GPU or Multiple GPUs

– Support CUDA Stream

2

Page 3: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Three Levels Of BLASThree Levels Of BLAS

3

Level 1This level contains vector operations of the form

y x y

Level 2This level contains matrix-vector operations of the form

y Ax y Level 3This level contains matrix-matrix operations of the form

C AB C

Page 4: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Why we need CUBLAS?Why we need CUBLAS?

• CUBLAS

– Full support for all 152 standard BLAS routines

– Support single-precision, double-precision, complex and double

complex number data types

– Support for CUDA steams

– Fortran bindings

– Support for multiple GPUs and concurrent kernels

– Very efficient

4

Page 5: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Why we need CUBLAS?Why we need CUBLAS?

5

Page 6: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Getting StartedGetting Started

• Basic preparation– Install CUDA Toolkit– Include cublas_v2.h– Link cublas.lib

• Some basic tips– Every CUBLAS function needs a handle– The CUBLAS function must be written between cublasCreate() and

cublasDestory()– Every CUBLAS function returns a cublasStatus_t to report the state of

execution.– Column-major storage

• References– http://cudazone.nvidia.cn/cublas/– CUDA Toolkit 5.0 CUBLAS Library.pdf

6

Page 7: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

CUBLAS Data TypesCUBLAS Data Types

• cublasHandle_t

• cublasStatus_t• CUBLAS_STATUS_SUCCESS• CUBLAS_STATUS_NOT_INITIALIZED• CUBLAS_STATUS_ALLOC_FAILED• CUBLAS_STATUS_INVALID_VALUE• CUBLAS_STATUS_ARCH_MISMATCH• CUBLAS_STATUS_MAPPING_ERROR• CUBLAS_STATUS_EXECUTION_FAILED• CUBLAS_STATUS_INTERNAL_ERROR

7

Page 8: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

CUBLAS Data TypesCUBLAS Data Types

• cublasOperation_t

8

Page 9: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

CUBLAS DatatypesCUBLAS Datatypes

• cublasFillMode_t

• cublasSideMode_t

9

Page 10: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

CUBLAS Data TypesCUBLAS Data Types

• cublasPointerMode_t

• cublasAtomicsMode_t

10

Page 11: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Example CodeExample Code

#include <stdio.h>#include <stdlib.h>#include <math.h>#include <cuda_runtime.h>#include "cublas_v2.h" //调用 CUBLAS必须包含的头文件#define M 6#define N 5#define IDX2F(i,j,ld) ((((j)-1)*(ld))+((i)-1)) //按列访问数组下标

static __inline__ void modify(cublasHandle_t handle,float* m,int ldm,int n,int p,int q,float alpha,float beta){

cublasSscal(handle,n-p+1,&alpha,&m[IDX2F(p,q,ldm)],ldm);cublasSscal(handle,ldm-p+1,&beta,&m[IDX2F(p,q,ldm)],1);

}

11

Page 12: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Example CodeExample Code

int main(void){cudaError_t cudaStat;cublasStatus_t stat;cublasHandle_t handle;int i,j;float* devPtrA;float* a=0;a=(float*)malloc(M*N*sizeof(*a)); //在 host上开辟数组空间if (!a){

printf("host memory allocation failed");return EXIT_FAILURE;

}

12

Page 13: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Example CodeExample Code

for (j=1;j<=N;j++) //数组初始化{

for (i=1;i<=M;i++){

a[IDX2F(i,j,M)]=(float)((i-1)*M+j);}

}cudaStat = cudaMalloc((void**)&devPtrA,M*N*sizeof(*a));//在 device上开辟内存空间if (cudaStat != cudaSuccess){

printf("device memory allocation failed");return EXIT_FAILURE;

}stat = cublasCreate(&handle); //初始化 CUBLAS环境

13

Page 14: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Example CodeExample Code

if (stat != cudaSuccess){

printf("CUBLAS initialization failed\n");return EXIT_FAILURE;

}stat = cublasSetMatrix(M,N,sizeof(*a),a,M,devPtrA,M);

//把数据从 host拷贝到 deviceif (stat != CUBLAS_STATUS_SUCCESS){

printf("data download failed");cudaFree(devPtrA);cublasDestroy(handle);return EXIT_FAILURE;

}modify(handle,devPtrA,M,N,2,3,16.0f,12.0f);stat = cublasGetMatrix(M,N,sizeof(*a),devPtrA,M,a,M);//把数据从 device拷贝到 host

14

Page 15: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Example CodeExample Code

if (stat != CUBLAS_STATUS_SUCCESS){

printf("data upload failed");cudaFree(devPtrA);cublasDestroy(handle);return EXIT_FAILURE;

}cudaFree(devPtrA); //释放指针cublasDestroy(handle); //关闭 CULBAS环境for (j=1;j<=N;j++){

for (i=1;i<=M;i++){

printf("%7.0f",a[IDX2F(i,j,M)]);}

}return EXIT_SUCCESS;

}15

Page 16: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Matrix MultiplyMatrix Multiply

• Use level-3 function

• Function Introduce• cublasStatus_t cublasSgemm(cublasHandle_t handle,

cublasOperation_t transa, cublasOperation_t transb,

int m, int n, int k,

const float *alpha,

const float *A, int lda,

const float *B, int ldb,

const float *beta,

float *C, int ldc)

16

( ) ( )C op Aop B C

Page 17: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Matrix MultiplyMatrix Multiply

17

Page 18: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Matrix MultiplyMatrix Multiply

18

int MatrixMulbyCUBLAS(float *A,float *B,int HA,int WB,int WA,float *C){float *d_A,*d_B,*d_C;CUDA_SAFE_CALL(cudaMalloc((void **)&d_A,WA*HA*sizeof(float)));CUDA_SAFE_CALL(cudaMalloc((void **)&d_B,WB*WA*sizeof(float)));CUDA_SAFE_CALL(cudaMalloc((void **)&d_C,WB*HA*sizeof(float)));

CUDA_SAFE_CALL(cudaMemcpy(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice));CUDA_SAFE_CALL(cudaMemcpy(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice));

cublasStatus_t status;cublasHandle_t handle;status=cublasCreate(&handle);if (status!=CUBLAS_STATUS_SUCCESS){printf("CUBLAS initialization error\n");return EXIT_FAILURE;}

Page 19: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Matrix MultiplyMatrix Multiply

19

int devID;cudaDeviceProp props;CUDA_SAFE_CALL(cudaGetDevice(&devID));CUDA_SAFE_CALL(cudaGetDeviceProperties(&props,devID));printf("Device %d: \"%s\" with Compute %d.%d capability\n", devID, props.name, props.major, props.minor);

const float alpha=1.0f;const float beta=0.0f;

cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB); //level 3 functionCUDA_SAFE_CALL(cudaMemcpy(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost));cublasDestroy(handle);cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);return 0;}

Page 20: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

The Rusult The Rusult

20

Page 21: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Some New FeaturesSome New Features

• The handle to the CUBLAS library context is initialized using the

cublasCreate function and is explicitly passed to every subsequent library

function call. This allows the user to have more control over the library setup

when using multiple host threads and multiple GPUs.

• The scalars a and b can be passed by reference on the host or the device,

instead of only being allowed to be passed by value on the host. This

change allows library functions to execute asynchronously using streams

even when a and b are generated by a previous kernel.

21

Page 22: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Some New FeaturesSome New Features

• When a library routine returns a scalar result, it can be returned by

reference on the host or the device, instead of only being allowed to be

returned by value only on the host. This change allows library routines to be

called asynchronously when the scalar result is generated and returned by

reference on the device resulting in maximum parallelism.

22

Page 23: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

StreamStream

• Stream– Concurrent Execution between Host and Device

• Overlap of Data Transfer and Kernel Execution– With devices of compute capability 1.1 or higher– Hidden Data Transfer Time

• Rules– Functions in a same stream execute sequentially– Functions in different streams execute concurrently

• References– http://cudazone.nvidia.cn/– CUDA C Programming Guide.pdf

23

Page 24: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Parallelism with StreamsParallelism with Streams

• Create and set stream to be used by each CUBLAS routine

– Users must call function cudaStreamCreate() to create different

streams .

– Users must call function cublasSetStream() to set a stream to be

used by each individual CUBLAS routine.

• Use asynchronous transfer function

– cudaMemcpyAsync()

24

Page 25: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Parallelism with StreamsParallelism with Streams

start=clock();for (int i = 0; i < nstreams; i++){cudaMemcpy(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice);

cudaMemcpy(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice);

cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB);

cudaMemcpy(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost);}end=clock();

printf(“GPU Without Stream time: %.2f秒 .\n", (double)(end-start)/CLOCKS_PER_SEC);

25

Page 26: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

Parallelism with StreamsParallelism with Streams

start=clock();for (int i = 0; i < nstreams; i++){cudaMemcpyAsync(d_A,A,WA*HA*sizeof(float),cudaMemcpyHostToDevice,streams[i]);

cudaMemcpyAsync(d_B,B,WB*WA*sizeof(float),cudaMemcpyHostToDevice,streams[i]);cublasSetStream(handle,streams[i]);

cublasSgemm(handle,CUBLAS_OP_N,CUBLAS_OP_N,WB,HA,WA,&alpha,d_B,WB,d_A,WA,&beta,d_C,WB);

cudaMemcpyAsync(C,d_C,WB*HA*sizeof(float),cudaMemcpyDeviceToHost);}end=clock();printf("GPU With Stream time: %.2f秒 .\n", (double)(end-start)/CLOCKS_PER_SEC);

26

Page 27: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

The Result The Result

27

Page 28: CUBLAS Library Dr. Bo Yuan E-mail: yuanb@sz.tsinghua.edu.cn.

28

ReviewReview

• What is core functionality of BLAS and CUBLAS?

• What is the advantage of CUBLAS?

• What is the importance of handle in CUBLAS?

• How to perform matrix multiplication using CUBLAS?

• How is a matrix stored in CUBLAS?

• How to use CUBLAS with stream techniques?

• What can we do using CUBLAS in our research?