20101030 opencl intro
-
Upload
ziming-hu -
Category
Technology
-
view
1.969 -
download
1
Transcript of 20101030 opencl intro
Brief Introduction to OpenCL
Hu Zi Ming
2010-10-30
1 / 24Brief Introduction to OpenCL
N
Outline
1 Some Background about OpenCLCPU vs. GPUWhat is OpenCLAdvantages & Disadvantages
2 Programming with OpenCL
3 Demo about OpenCL
2 / 24Brief Introduction to OpenCL
N
CPU vs. GPU
CPU:
Make single thread fastHide latency though large cache
GPU:
Improvement thoughputHide latency though prarllelism
3 / 24Brief Introduction to OpenCL
N
CPU vs. GPU
CPU:
Make single thread fastHide latency though large cache
GPU:
Improvement thoughputHide latency though prarllelism
3 / 24Brief Introduction to OpenCL
N
Before OpenCL. . .
Nvidia CUDA
ATI stream
Microsoft DirectComputer
. . . . . .
Apple said, Let there be standard
And there was OpenCL
4 / 24Brief Introduction to OpenCL
N
Before OpenCL. . .
Nvidia CUDA
ATI stream
Microsoft DirectComputer
. . . . . .
Apple said, Let there be standard
And there was OpenCL
4 / 24Brief Introduction to OpenCL
N
Before OpenCL. . .
Nvidia CUDA
ATI stream
Microsoft DirectComputer
. . . . . .
Apple said, Let there be standard
And there was OpenCL
4 / 24Brief Introduction to OpenCL
N
What is OpenCL
Open Computing Language
Based on C for CUDA but slightly lower
Originally developed by Apple
Handed over to the Khronos Group now
Can be used in parallel computing
5 / 24Brief Introduction to OpenCL
N
Advantages
Support heterogeneous platforms
Task-based(CPU) and data-based(GPU) parallelism for parallelcomputing
Improve memory bandwidth and compute bandwidth greatly
Extends the GPU power w/o been locked in one manufacturer
Support extensions like OpenGL
Support ES mode for mobile devices
6 / 24Brief Introduction to OpenCL
N
Disadvantages
Tunning is hardware-specific
Algorithm is binded with data shape
Recursion is not available now
Function pointer is not supported now
7 / 24Brief Introduction to OpenCL
N
Outline
1 Some Background about OpenCL
2 Programming with OpenCLPrerequisiteMain Flow of Host CodeFour Models
3 Demo about OpenCL
8 / 24Brief Introduction to OpenCL
N
Prerequisite
Driver support OpenCL
ATI Stream SDK/NVIDIA CUDA Toolkit/. . .
Host code: control kernel code
OpenCL kernel code: written in OpenCL and run on devices
9 / 24Brief Introduction to OpenCL
N
Main Flow of Host Code
Get information about the platform and devices
Select devices to be used in execution
Create an OpenCL context
Create a command queue
Create memory buffer objects
Create program object
Load the kernel source code and compile it
Create kernel object
Set kernel arguments
Execute the kernel
Copy memory from GPU to CPU
10 / 24Brief Introduction to OpenCL
N
OpenCL Summary
11 / 24Brief Introduction to OpenCL
N
Four Models
Platform model
Execution model
Memory model
Programming model
12 / 24Brief Introduction to OpenCL
N
Platform Model
A host connected to one or more OpenCL devices
Device can be divided into one or more compute units (CUs)
Compute unit can be further divided into one or moreprocessing elements (PEs)
Application send commands from host to PE
PE within CU execute instructions as SIMD/SPMD units
13 / 24Brief Introduction to OpenCL
N
Platform Model (Cont.)
14 / 24Brief Introduction to OpenCL
N
Execution Model
Work item is the basic unit of work
Kernel is code for work item
Executed on OpenCL devices, basically a C function
Host program executed on host
Create index space based on NDRange
Organize work-item as work-group
15 / 24Brief Introduction to OpenCL
N
Execution Model
Work item is the basic unit of work
Kernel is code for work item
Executed on OpenCL devices, basically a C function
Host program executed on host
Create index space based on NDRange
Organize work-item as work-group
15 / 24Brief Introduction to OpenCL
N
Execution Model
Work item is the basic unit of work
Kernel is code for work item
Executed on OpenCL devices, basically a C function
Host program executed on host
Create index space based on NDRange
Organize work-item as work-group
15 / 24Brief Introduction to OpenCL
N
Execution Model
Work item is the basic unit of work
Kernel is code for work item
Executed on OpenCL devices, basically a C function
Host program executed on host
Create index space based on NDRange
Organize work-item as work-group
15 / 24Brief Introduction to OpenCL
N
Execution Model (Cont.)
16 / 24Brief Introduction to OpenCL
N
Memory Model
Global mem: r/w to all work-item in all work-groups
Constant mem: global mem and remain constant duringexecution
Local mem: local to a work-group
Private mem: private to work-item
Data move path: host -¿ global -¿ local and back
17 / 24Brief Introduction to OpenCL
N
Memory Model
Global mem: r/w to all work-item in all work-groups
Constant mem: global mem and remain constant duringexecution
Local mem: local to a work-group
Private mem: private to work-item
Data move path: host -¿ global -¿ local and back
17 / 24Brief Introduction to OpenCL
N
Memory Model
18 / 24Brief Introduction to OpenCL
N
Programming Model
Data parallel programming model
Task parallel programming model
Synchronization
19 / 24Brief Introduction to OpenCL
N
Outline
1 Some Background about OpenCL
2 Programming with OpenCL
3 Demo about OpenCLMatrix AddMatrix Multiply
20 / 24Brief Introduction to OpenCL
N
Kernel Code
normal add
__kernel void add(__global int *a, __global int *b, __global int *c) {int i = get_global_id(0);c[i] = a[i] + b[i];
}
21 / 24Brief Introduction to OpenCL
N
Normal Kernel Code
normal multiply
__kernel void mul(__global int *a, __global int *b, __global int *c) {int x = get_global_id(1);int y = get_global_id(0);int i = 0;c[y * WC + x] = 0;for (; i < W; i++) {
c[y * WC + x] += a[y * WA + i] * b[i * WB + x];}
}
22 / 24Brief Introduction to OpenCL
N
Kernel Code with Block Support
multiply with block support
__kernel void mul(__global float *a, __global float *b, __global float *c,__local float *as, __local float *bs) {int x = get_global_id(1);int y = get_global_id(0);int bx = get_group_id(1);int by = get_group_id(0);int tx = get_local_id(1);int ty = get_local_id(0);
int tmp_val = 0;c[x * WC + y] = 0;for (int i = 0; i < WA / BLOCK_SIZE; i++) {
as[ty * BLOCK_SIZE + tx] = a[y * WA + x];bs[ty * BLOCK_SIZE + tx] = b[y * WA + x];barrier(CLK_LOCAL_MEM_FENCE);
for (int j = 0; j < BLOCK_SIZE; j++) {tmp_val += a[y * WA + i] * b[i * WB + x];barrier(CLK_LOCAL_MEM_FENCE);
}
c[y * WB + x] = tmp_val;}
}
23 / 24Brief Introduction to OpenCL
N
Q AND A
24 / 24Brief Introduction to OpenCL
N