Griffon Topic2 Presentation (Tia)
-
Upload
nat-weerawan -
Category
Documents
-
view
1.185 -
download
2
description
Transcript of Griffon Topic2 Presentation (Tia)
GRIFFON GPU PROGRAMMING API FOR SCIENTIFIC AND GENERAL PURPOSE
PISIT MAKPAISIT 4909611727SUPERVISOR : DR. WORAWAN DIAZ CARBALLO
DEPARTMENT OF COMPUTER SCIENCE, FACULTY OF SCIENCE AND TECHNOLOGY, THAMMASAT UNIVERSITY
04/08/2023
2
Griffon - GPU Programming API for Scientific and General Purpose
• GPU-CPU performance gap • GPGPU• GPU programming model complexity
Motivation
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
3
GPU-CPU performance gap
All we have graphic card in PC Processor unit in graphic card called “GPU” Therefore every PC have GPU Now GPU performance is pulling away from traditional
processors
http://developer.download.nvidia.com/compute/cuda/2_2/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.2.pdf
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
4
GPGPU
General-Purpose computation on Graphics Processing Units
Very high computation and data throughput
Scalability
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
5
GPGPU Applications
Simulation Finance Fluid Dynamics Medical Imaging Visualization Signal Processing Image Processing Optical Flow Differential Equation Linear Algebra Finite Element Fast Fourier Transform etc.
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
6
Vector Addition
1 5 6 8 9 1 2 3 6 5Vector A
5 4 1 1 5 6 5 8 9 2Vector B
+
6 9 7 9 14 7 7 11 15 7Vector C
=
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
7
Vector Addition (Sequential Code)
#include <stdio.h>
#define SIZE 500
void VecAdd(float *A, float *B, float *C){
int i;
for(i=0;i<SIZE;i++)
C[i] = A[i] + B[i]
}
Declare Function
void main(){int i, size = SIZE *
sizeof(float);float *A, *B, *C;
Declare Variables
A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);
Memory Allocate
free(A);free(B);free(C);
}
VecAdd(A,B,C);Function Call
Memory De-Allocate
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
8
Vector Addition (Sequential Code)
1 5 6 8 9 1 2 3 6 5Vector A
5 4 1 1 5 6 5 8 9 2Vector B
+
Vector C
=
6
+
=
9
+
=
7
+
=
9
+
=
14
+
=
7
+
=
7
+
=
11
+
=
15
+
=
7
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
9
Improve Performance
We can improve vector with parallel computing
Data Parallelism – simultaneously add each elements
1st choice
Multicore on CPU OpenMP
2nd choice
Multicore on GPU CUDA
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
10
Vector Addition (OpenMP)
#include <stdio.h>#define SIZE 500
void VecAdd(float *A, float *B, float *C){int i;
for(i=0;i<SIZE;i++)C[i] = A[i] + B[i]
}void main(){
int i, size = SIZE * sizeof(float);
float *A, *B, *C;A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);
VecAdd(A,B,C);
free(A);free(B);free(C);
}
1. Sequential Code#pragma omp parallel for
2. Add Compiler Directive
3. Finish
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
11
Vector Addition (OpenMP)
1 5 6 8 9 1 2 3 6 5Vector A
5 4 1 1 5 6 5 8 9 2Vector B
+
Vector C
=
6
+
=
9
+
=
7
+
=
9
+
=
14
+
=
7
+
=
7
+
=
11
+
=
15
+
=
7
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
12
Speed Up (Amdahl’s Law)
Execution time (Sequential)
Vector Addition ~ 80%
Vector Addition New Exec. Time = Exec. Time / Core = 80% / 2
Execution time (Parallel on CPU)
Vector Addition
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
13
OpenMP
Easy and automatic threads management
Few threads on CPU
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
14
Vector Addition (GPU - CUDA)
1 5 6 8 9 1 2 3 6 5
Vector A on CPU
5 4 1 1 5 6 5 8 9 2
Vector B on CPU
+
Vector C on CPU
=
6
+
=
9
+
=
7
+
=
9
+
=
14
+
=
7
+
=
7
+
=
11
+
=
15
+
=
7
1 5 6 8 9 1 2 3 6 5
5 4 1 1 5 6 5 8 9 2
6 9 7 9 14 7 7 11 15 7
Copy
Copy
CPU Memory GPU Memory
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
15
Parallel Vector Addition on GPU (CUDA)
#include <stdio.h>
#define SIZE 500
__global__ void VecAdd(float* A, float* B, float* C){
int idx = threadIdx.x;
if(idx < SIZE)
C[idx] = A[idx] + B[idx];
}
Declare Kernel Function
void main(){int i, size = SIZE * sizeof(float);float *h_A, *h_B, *h_C, *d_A, *d_B,
*d_C;
Declare Variables
h_A = (float*)malloc(size);h_B = (float*)malloc(size);h_C = (float*)malloc(size);
CPU Memory Allocate
cudaMalloc((void**)&d_A, size);cudaMalloc((void**)&d_B, size);cudaMalloc((void**)&d_C, size);
GPU Memory Allocate
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
16
Parallel Vector Addition on GPU (CUDA)
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
Data Transfer from CPU to GPU
addVec<<<1, SIZE>>>(d_A, d_B, d_C);Kernel Call
cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);
Data Transfer from GPU to CPU
free(h_A);free(h_B);free(h_C);
CPU Memory De-Allocate
cudaFree(d_A);cudaFree(d_B);cudaFree(d_C);
}
GPU Memory De-Allocate
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
17
Speed Up (Amdahl’s Law)
Execution time (Sequential)
Vector Addition ~ 80%
Vector Addition New Exec. Time = Exec. Time / Core = 80% / 16
Execution time (Parallel on GPU)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
18
CUDA
Speed up but spend more effort and time Many threads on GPU
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
19
CUDA Memory Model
Global Memory – Off-chip, large, shared by all threads, slow, host can read and write
Local Memory – per one thread , faster than Global Memory
Shared Memory – shared by all threads in block, faster than Global Memory
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
20
Griffon
Simple programming model (OpenMP)
Computing Performance (GPU - CUDA)+
=Easy and Efficient (Griffon)
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
21
Parallel Vector Addition on GPU (Griffon)
#include <stdio.h>#define SIZE 500
void VecAdd(float *A, float *B, float *C){int i;
for(i=0;i<SIZE;i++)C[i] = A[i] + B[i]
}void main(){
int i, size = SIZE * sizeof(float);
float *A, *B, *C;A = (float*)malloc(size);B = (float*)malloc(size);C = (float*)malloc(size);
VecAdd(A,B,C);
free(A);free(B);free(C);
}
1. Sequential Code#pragma gfn parallel for
So Easy !!
2. Add Compiler Directive
3. Finish
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
22
Griffon
Compiler directive for C-Language
Source-to-source compiler Automatic data management Optimization
04/08/2023
23
Griffon - GPU Programming API for Scientific and General Purpose
Objectives
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
24
Objectives (1/2)
To develop a set of GPU programming APIs, called Griffon, to support the development of CUDA-based programs. Griffon comprises a) compiler directives and b) a source-to-source compiler Simple – The numbers of compiler directives do not
exceed 20 instructions. The grammar of griffon directives is similar to OpenMP, i.e. a standard shared-memory API.
Thread safety – The codes generated by Griffon will give the correct behaviors, i.e. equivalent to that of sequential codes.
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
25
Objectives (2/2)
To demonstrate that Griffon generated codes can gain reasonable performance over the sequential codes on two example applications: Pi calculation using numerical integration, and Monte Carlo method: Automatic – The GPU memory management
of generated codes is done automatically by Griffon.
Efficient – When using Griffon, generated codes could gain the actual speed up according to Amdahl’s law or with a difference less than 20%.
04/08/2023
26
Griffon - GPU Programming API for Scientific and General Purpose
Project Constraint
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
27
Project Constraint
Griffon is a C-language API that supports both Windows and Linux environments
The generated executable program can only run on the NVIDIA graphic card.
Uses can use Griffon in cooperated with OpenMP.
04/08/2023
28
Griffon - GPU Programming API for Scientific and General Purpose
Related Works
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
29
Brook+ & CUDA
General propose computation on GPU Manual kernel and data transfer on
various GPU memory management Vendor dependent
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
30
OpenCL (Open Computing Language)
Cross-platform and Vendor neutral Approachable language for accessing
heterogeneous computational resources (CPU, GPU, other processor)
Data and Task Parallelism
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
31
OpenMP to GPGPU
OpenMP applications into CUDA-based GPGPU applications
GPU Optimization technique – Parallel Loop Swap and Loop-collapsing, to enhance inter-thread locality
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
32
hiCUDA
Directive-based GPU Programming Language
Computation Model for identify code region that executed on GPU
Data Model for allocate and de-allocate memory on GPU and data transfer
04/08/2023
33
Griffon - GPU Programming API for Scientific and General Purpose
• Software Architecture• Directives• Griffon Compilation Process• Optimization Techniques
Methodology
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
34
Software Architecture
NVCC is one of the Griffon toolchain.
Griffon source-to-source compiler comprises oMemory Allocator and Optimizer
Griffon CompilerGriffon Compiler
NVCC (NVIDIA CUDA Compiler)
Griffon C Application
CUDA C Application
PTX compiler GCC (Linux),CL (MS
Windows)
PTX code C code
CPU object codeGPU object code
Executable
Compile-time Memory Allocator
Optimizer
04/08/2023
35
Griffon - GPU Programming API for Scientific and General Purpose
Directives
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
36
Griffon Directives
Parallel Region
Control Flow
GPU/CPU Overlap Compute
Synchronous
Define synchronou
s point
Specify kernel work
flow
Define region that CPU overlap
compute with GPU
Define parallel region
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
37
Directives
#pragma gfn directive-name [clause[ [,] clause]...] new-line
#pragma gfn parallel for [clause[ [,] clause]...] new-linefor-loops
Clause : kernelname(name)
waitfor(kernelname-list)private(var-list)accurate([low,high])reduction(operator:var-list)
Parallel Region
General Form
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
38
Parallel Region
for(i=0;i<N;i++){C[i] = A[i] +
B[i];}
#pragma gfn parallel forfor(i=0;i<N;i++){
C[i] = A[i] + B[i];}
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
Kernel Flow Control39
#pragma gfn parallel for kernelname( A ) #pragma gfn parallel for kernelname( B ) waitfor( A ) #pragma gfn parallel for kernelname( C ) waitfor( A ) #pragma gfn parallel for kernelname( D ) waitfor( B,C )
A
CB
D
Kernel B and C can compute in parallel
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
40
Synchronization
#pragma gfn barrier new-line
#pragma gfn atomic newlineassignment-statement
Atomic
Synchronous Point
#pragma gfn parallel for reduction(operation,var-list)
Parallel Reduction
P0P0
P1P1
P2P2P3P3
P0P0
P1P1
P2P2
P3P3Barr
ier
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
41
Synchronization
#pragma gfn parallel forfor(i=1;i<N-1;i++){
B[i] = A[i-1] + A[i] + A[i+1;#pragma gfn barrierA[i] = B[i];if(A[i] > 7){
#pragma gfn atomicC[i] += x / 5;
}}
for(i=1;i<N-1;i++){B[i] = A[i-1] + A[i] +
A[i+1;}for(i=1;i<N-1;i++){
A[i] = B[i];if(A[i] > 7){
C[i] += x / 5;}
}
#pragma gfn parallel forfor(i=1;i<N-1;i++){
B[i] = A[i-1] + A[i] + A[i+1;}#pragma gfn parallel forfor(i=1;i<N-1;i++){
A[i] = B[i];if(A[i] > 7){
#pragma gfn atomicC[i] += x / 5;
}}
Option 1
Option 2
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
42
Synchronization
#pragma gfn parallel for \private(x) reduction(+:integral)for (i = 1; i <= n-1; i++) {
x = a + (i * h); integral = integral + f(x);}
for (i = 1; i <= n-1; i++) {x = a + (i * h);
integral = integral + f(x);}
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
43
GPU/CPU Overlap compute
#pragma gfn overlapcompute(kernelname) newlinestructure-block
Many threads on GPU
CPU function
GPU/CPU Synchronize
Parallel
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
44
GPU/CPU Overlap compute
for(i=0;i<N;i++){…
}independenceCpuFunction();
#pragma gfn parallel for kernelname( calA )for(i=0;i<N;i++){
…}#pragma gfn overlapcompute( calA )independenceCpuFunction();
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
45
Accurate Level
#pragma gfn parallel for accurate( [low, high] )
Use low when speed is important
Use high when precision is important
Default is high
04/08/2023
46
Griffon - GPU Programming API for Scientific and General Purpose
Griffon Compilation Process
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
47
Create Kernel
int main(){int sum = 0;int x, y;#pragma gfn parallel
for \ private(x, y) reduction(+:sum)
for(i=0;i<N;i++){x = sin(A[i]);y = cos(B[i]);
C[i] = x + y; }return 0;
}
__global__ void __kernel_0(…, int __N){int __tid = blockIdx.x * blockDix.x +
threadIdx.x;int i = __tid [* 1 + 0] ;
if(__tid<N){x = sin(A[i]);y = cos(B[i]);C[i] = x + y;
}}int main(){
int sum = 0;int x, y;
__kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(..., (N - 1 - 0) / 1 + 1);
// Insert kernel callreturn 0;
}
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
48
For-Loop Format and Thread Mapping
For-loop must be in format for( index = min ; index <= max ; index += increment ){
…}
for( index = max ; index >= min ; index -= increment ){ …} // This case will be transformed to first case
Number of Thread can calculate by formula
Iterative Index and Thread Mapping__tid = blockIdx.x * blockDix.x + threadIdx.x;index = __tid * increment + min;
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
49
Private and shared variable management
Shared variables much be pass to kernel function
Private variables mush be declare in kernel fucntion
Declare GPU device variables for shared variable Size for allocate
Static : size when declare. Ex int A[500]; Dynamic : allocate function – malloc, calloc, realloc
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
50
Private and shared variable management
int main(){int sum = 0;int x, y; int A[N], B[N], C[N] ;
#pragma gfn parallel for \ private(x, y) reduction(+:sum)
for(i=0;i<N;i++){x = sin(A[i]);y = cos(B[i]);
C[i] = x + y; }return 0;
}
__global__ void __kernel_0(int * A, int * B, int * C, int __N){
int __tid = blockIdx.x * blockDix.x + threadIdx.x;
int i = __tid [* 1 + 0] ;int x, y;
if(__tid<N){x = sin(A[i]);y = cos(B[i]);C[i] = x + y;
}}int main(){
int sum = 0;int x, y;int A[N], B[N], C[N] ;int * __d_A ,* __d_B ,* __d_C ;cudaMalloc((void**)&__d_C,sizeof(int) * N);cudaMalloc((void**)&__d_B,sizeof(int) * N);cudaMalloc((void**)&__d_A,sizeof(int) * N);
__kernel_0<<<(((N - 1 - 0) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_B, __d_C, (N - 1 - 0) / 1 + 1);
cudaFree(__d_C); cudaFree(__d_B); cudaFree(__d_A);
return 0;}
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
51
Reduction variable management
int main(){…#pragma gfn parallel for \ reduction(+:sum)for(i=0;i<MAX;i++){
...sum += A[i];...
}...
}
__global__ void __kernel_0(float *A, float * global___sum_add){int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid ;int __rtid = threadIdx.x ;__shared__ int __sum_add[512] ;int sum = 0 ;
__sum_add[__rtid] = 0;if( __tid < __N ){
…sum += c[i];
__sum_add[__rtid] = sum;__syncthreads();if(__rtid < 256) __sum_add[__rtid] +=
__sum_add[__rtid + 256];__syncthreads();if(__rtid < 128) __sum_add[__rtid] +=
__sum_add[__rtid + 128];__syncthreads();if(__rtid < 64) __sum_add[__rtid] +=
__sum_add[__rtid + 64];__syncthreads();if(__rtid < 32) __sum_add[__rtid] +=
__sum_add[__rtid + 32];__syncthreads();if(__rtid < 16) __sum_add[__rtid] +=
__sum_add[__rtid + 16];if(__rtid < 8) __sum_add[__rtid] += __sum_add[__rtid
+ 8];if(__rtid < 4) __sum_add[__rtid] += __sum_add[__rtid
+ 4];if(__rtid < 2) __sum_add[__rtid] += __sum_add[__rtid
+ 2];if(__rtid < 1) __sum_add[__rtid] += __sum_add[__rtid
+ 1];}if(__rtid == 0)
atomicAdd(global___sum_add, __sum_add[0]);}
Very complex because optimize parallel reduction implementation
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
52
Replace math functions & GPU functions
int f1(int a){return ++a;
}int f0(int a){
return f1(a) + 5;}
#pragma gfn parallel forfor(i=0;i<N;i++){
A[i] = f0(A[i]) + sin(B[i]);
}
__device__ int __device_f1(int a){return ++a;
}__device__ int __device_f0(int a){
return __device_f1(a) + 5;}
__global__ void __kernel_1(int *A, int *B, int N){…A[i] = __device_f0(A[i]) + __sinf(B[i]);
}
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
53
Barrier and Atomic
__global__ void __kernel_A(…){if(tid<__N){
B[i] = A[i-1] + A[i] + A[i+1; #pragma gfn barrier
A[i] = B[i];#pragma gfn atomicC[i] += x / 5;
}}
__global__ void __kernel_A(…){if(tid<__N){
B[i] = A[i-1] + A[i] + A[i+1; __threadfence();
A[i] = B[i];atomicAdd(&C[i], x / 5);
}}
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
54
Kernel call and data transfer sort
Detail in optimization section
__kernel_K<<<((((N - 1) - 1 - 1) / 1 + 1) - 1 + 512.00) / 512.00,512>>>(__d_A, __d_C, ((N - 1) - 1 - 1) / 1 + 1);__kernel_0<<<(((N - 1 - 0) / 5 + 1) - 1 + 512.00) / 512.00,512>>>(__d_D, __d_B, __d_A, (N - 1 - 0) / 5 + 1, global___sum_add); cudaMemcpy(&sum,global___sum_add,sizeof(int), cudaMemcpyDeviceToHost );cudaMemcpy(A,__d_A,sizeof(int) * N, cudaMemcpyDeviceToHost );cudaMemcpy(D,__d_D,sizeof(int) * N, cudaMemcpyDeviceToHost );
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
55
Automatic cache with shared memory
Detail in optimization section
__global__ void __kernel_0 (int * B, int * A, int __N){
int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid * 1 + 1 ;__shared__ int sa[514] ;
if(__tid < __N){
sa[threadIdx.x + 0] = A[i + 0 - 1];if(threadIdx.x + 512 < 514)sa[threadIdx.x + 512] = A[i + 512 - 1];__syncthreads();B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] +
sa[threadIdx.x + 1 + 1];}
}
#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){
B[i] = A[i-1] + A[i] + A[i+1];}
04/08/2023
56
Griffon - GPU Programming API for Scientific and General Purpose
• Maximum thread on GPU• Reduce data transfer with analysis control flow• Reduce data transfer with kernel control flow• Overlapping kernel and data transfer and asynchronous data transfer • Automatic cache with shared memory
Optimization Techniques
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
57
Reduce data transfer with analysis control flow
A, B transfer from CPU to GPU C transfers from GPU to CPU D is both
#pragma gfn parallel forfor(i=0;i<N;i++){
C[i] = A[i] + B[i] + D[i];
D[i] = C[i] * 0.5;}
Used variable Defined variable
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
58
Reduce data transfer with kernel control flow
Memcpy Host to Device for Variable that is defined in kernel Memcpy Device to Host for Variable that is used in kernel
#pragma gfn parallel for
for(i=0;i<N;i++){C[i] = A[i] + B[i];
}
cudaMemcpy(dA, A, size, cudaMemcpyHostToDevice );
cudaMemcpy(dB, B, size, cudaMemcpyHostToDevice );
Kernel <<< … , … >>> ( … )
cudaMemcpy(C, dC, size, cudaMemcpyDeviceToHost);
K1
A
C
B
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
59
Reduce data transfer with kernel control flow
Use graph defined by kernelname and waitfor construct
K1
K2
A
DCC
A
B
E
#pragma gfn parallel for \kernelname(k1)for(i=0;i<N;i++){
C[i] = A[i] + B[i];}#pragma gfn parallel for \kernelname(k2) waitfor(k1) for(i=0;i<N;i++){
E[i] = A[i] * C[i] – D[i];C[i] = E[i] / 3.0;
}
C
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
60
Reduce data transfer with kernel control flow
If there is a path from k1 to k21. If invar of k1 is
same as invar of k2 delete invar of k2
2. If outvar of k1 is same as outvar of k2 delete outvar of k1
3. if outvar of k1 is same as invar of k2 delete invar of k2
K1
K2
A
DCC
A
B
E C
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
Schedule Kernel and Memcpy for Maximum overlap
K1
K2
AB
D
K3
C
E
Already reduce transfer nodes graph
How to schedule?
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
62
Schedule for synchronous function
K1 K2AB D K3C E
62
Total Time = T(K1) + T(B) + T(A) + T(K2) + T(D) + T(C) + T(K3) + T(KE)
New version of CUDA API has asynchronous data transfer function
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
63
Schedule Kernel and Memcpy for Maximum overlap
Memcpy and Kernel can be overlaped
Maximum is 3-ways overlap MemcpyHostToDevice Kernel MemcpyDeviceToHost
4-ways overlap If include CPU compute by overlapcompute directive
K1
K2
A
B
D K3
C
E
Level 1
Level 2
Level 3
Level 4
1 2
12 3
12
1
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
64
K1
K2
A
B
D K3
C
E
Level 1
Level 2
Level 3
Level 4
1 2
12 3
12
1
1. Set queue to empty2. Until all node is deleted
1.1. Set level =1 and stream_num = 1;1.2. Find 0 incoming degree kernel node,
delete node and link, create transfer command with stream_num1.2.1. if found in 1.2 stream_num += 1
1.3. Find 0 incoming degree GPU to CPU node, delete node and link, create transfer command with stream_num1.3.1 if found in 1.3 stream_num += 1
1.4. Find 0 incoming degree CPU to GPU node, delete node and link, create transfer command with stream_num1.4.1 if found in 1.4 stream_num += 1
1.5. if 1.2-1.4 is not found, find 0 incoming degree kernel node , create transfer command for CPU to GPU node
1.6. Insert synchronous function1.7. Collect max stream_num1.8. level += 1;
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
65
Automatic cache with shared memory
When detect “linear access” pattern in kernel automatic cache will work
Thread block1
Global Memory
Shared
Shared
Shared
#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){
B[i] = A[i-1] + A[i] + A[i+1];}
Thread block2
Thread block 3
… Shared
Thread block n
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
66
Automatic cache with shared memory
__global__ void __kernel_0 (int * B, int * A, int __N){
int __tid = blockIdx.x * blockDim.x + threadIdx.x ;int i = __tid * 1 + 1 ;__shared__ int sa[514] ;
if(__tid < __N){
sa[threadIdx.x + 0] = A[i + 0 - 1];if(threadIdx.x + 512 < 514)sa[threadIdx.x + 512] = A[i + 512 - 1];__syncthreads();B[i] = sa[threadIdx.x + 1 - 1] + sa[threadIdx.x + 1] +
sa[threadIdx.x + 1 + 1];}
}
#pragma gfn parallel forfor(i=1;i<(MAX-1);i++){
B[i] = A[i-1] + A[i] + A[i+1];}
04/08/2023
67
Griffon - GPU Programming API for Scientific and General Purpose
DEMO
04/08/2023
68
Griffon - GPU Programming API for Scientific and General Purpose
• Compiler Directives• Compiler Performance
Evaluation
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
69
Compiler Directives
Program 1 Program 2 Program 30
5
10
15
20
25
30
GriffonCUDA
Program
Tim
e (
min
ute
)
5 undergraduate students who have studied the concepts of CUDA
only 1.5 hour of demonstration
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
70
Compiler Directives
PNI PMC TR VN SOV0
20
40
60
80
100
120
Sequen-tial
CUDA
Griffon
Application
Lin
es o
f co
de
s
Calculation of Pi Using Numerical Integration
Calculation of Pi Using the Monte Carlo Method
Trapezoidal Rule Vector
Normalization Calculate Sine of
Vector’s Element
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
71
Compiler Performance
PNI PMC TR VN SOV0
5
10
15
20
25
SequencialParallel (Griffon)
Application
Sp
ee
d U
p
Expected Speed up
Calculation of Pi Using Numerical Integration
Calculation of Pi Using the Monte Carlo Method
Trapezoidal Rule Vector
Normalization Calculate Sine of
Vector’s Element
04/08/2023
72
Griffon - GPU Programming API for Scientific and General Purpose
Conclusion
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
73
Griffon Instruction
Total numbers of instructions (Directive +
Clause): 9 Problem is performance of high
communication degree parallel program Improve directive for describe algorithm in
program (Divide and conquer, Partial summation, etc.)
New optimization technique such as cache with shared memory, appropriate thread number
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
74
Performance factor and speed up
Parallelism
Data Transfer
Computation
Density
Speed Up
Calculation of Pi Using Numerical
Integration
High Very Low Low 1.76
Calculation of Pi Using the Monte Carlo Method
High Average High 7.36
Trapezoidal Rule High Very Low High 19.28
Vector Normalization High High Low 1.21
Calculate Sine of Vector’s Element
Very High High High 3.78Computation density is most effect on Performance
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
75
Building S2S Compiler
Source to source compilers aren’t popular
Compiler that transform Griffon code to GPU object code (PTX) Although the programs generated by a PTX
compiler could be very efficient, they cannot gain any benefits from manual optimization.
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
76
Future Work
Optimization Techniques Data Structure Loop transformation
Directives More support OpenMP CPU/GPU Parallel region Support OpenCL
Compiler Support C++, other language Support popular IDE
04/08/2023Griffon - GPU Programming API for Scientific and General Purpose
77
Reference Brook, http://graphics.stanford.edu/projects/brookgpu Cameron Hughes, Tracey Hughes, Professional Multicore Programming, Wiley
Publishing CUDA Zone, http://www.nvidia.com/object/cuda_home.html Dick Grune, Henri E. Bal, Carial J.H. Jacobs and Koen G. Langendoen, Modern
Compiler Design, John Wiley & Sons Ltd General-Purpose Computation on Graphic Hardware, http://gpgpu.org Ilias Leontiadis, George Tzoumas, OpenMP C Parser Joe Stam, Maximizing GPU Efficiency in Extreme Throughput Applications, GPU
Technology Conference Mark Harris, Optimizing Parallel Reduction in CUDA OpenCL, http://www.khronos.org/opencl Seyong Lee, Seung-Jai Min, and Rudolf Eigenmann. OpenMP to GPGPU: A Compiler
Framework for Automatic. PPoPP ’09 The OpenMP API specification for parallel programming, http://openmp.org/wp Thomas Niemann, A Guide to Lex & Yacc Tianyi David Han, Tarek S. Abdelrahman. hiCUDA: A High-level Directive-based
Language for GPU Programming. GPGPU '09 Wolfe, M. (1996). High Performance Compilers for Parallel Computing. Addison-Wesley