Basic CUDA Programming

31
Basic CUDA Basic CUDA Programming Programming Shin-Kai Chen Shin-Kai Chen [email protected] [email protected] VLSI Signal Processing Laboratory VLSI Signal Processing Laboratory Department of Electronics Engineering Department of Electronics Engineering National Chiao Tung University National Chiao Tung University

description

Basic CUDA Programming. Shin-Kai Chen [email protected] VLSI Signal Processing Laboratory Department of Electronics Engineering National Chiao Tung University. What will you learn in this lab?. Concept of multicore accelerator Multithreaded/multicore programming - PowerPoint PPT Presentation

Transcript of Basic CUDA Programming

Page 1: Basic CUDA Programming

Basic CUDA Basic CUDA ProgrammingProgramming

Shin-Kai ChenShin-Kai [email protected]@twins.ee.nctu.edu.tw

VLSI Signal Processing LaboratoryVLSI Signal Processing LaboratoryDepartment of Electronics EngineeringDepartment of Electronics Engineering

National Chiao Tung UniversityNational Chiao Tung University

Page 2: Basic CUDA Programming

What will you learn in this lab?

• Concept of multicore accelerator• Multithreaded/multicore

programming• Memory optimization

Page 3: Basic CUDA Programming

Slides• Mostly from Prof. Wen-Mei Hwu of

UIUC– http://courses.ece.uiuc.edu/ece498/

al/Syllabus.html

Page 4: Basic CUDA Programming

CUDA – Hardware? Software?

. . .. . .

. . .. . .

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0 ,1,0)

Thread(1 ,1,0)

Thread(2,1,0)

Thread(3 ,1,0)

Thread(0 ,0,0)

Thread(1 ,0,0)

Thread(2,0,0)

Thread(3 ,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/storeLoad/store

Global Memory

Thread Execution Manager

Input Assembler

Host

TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTexture TextureTextureTextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Thread Id #:0 1 2 3 … m

Thread program

Application

CUDA

Platform

Page 5: Basic CUDA Programming

Host-Device ArchitectureCPU

(host)GPU w/

local DRAM(device)

Page 6: Basic CUDA Programming

G80 CUDA mode – A Device Example

Load/store

Global Memory

Thread Execution Manager

Input Assembler

Host

Texture Texture Texture Texture Texture Texture Texture TextureTexture

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Parallel DataCache

Load/store Load/store Load/store Load/store Load/store

Page 7: Basic CUDA Programming

Functional Units in G80• Streaming Multiprocessor (SM)

– 1 instruction decoder ( 1 instruction / 4 cycle )

– 8 streaming processor (SP)– Shared memory

t0 t1 t2 … tm

Blocks

SP

SharedMemory

MT IU

SP

SharedMemory

MT IUt0 t1 t2 … tm

Blocks

SM 1SM 0

Page 8: Basic CUDA Programming

Setup CUDA for Setup CUDA for WindowsWindows

Page 9: Basic CUDA Programming

CUDA Environment Setup

• Get GPU that support CUDA– http://www.nvidia.com/object/

cuda_learn_products.html• Download CUDA

– http://www.nvidia.com/object/cuda_get.html• CUDA driver• CUDA toolkit• CUDA SDK (optional)

• Install CUDA• Test CUDA

– Device Query

Page 10: Basic CUDA Programming

Setup CUDA for Visual Studio

• From scratch– http://forums.nvidia.com/index.php?

showtopic=30273• CUDA VS Wizard

– http://sourceforge.net/projects/cudavswizard/

• Modified from existing project

Page 11: Basic CUDA Programming

Lab1: First CUDA Lab1: First CUDA ProgramProgram

Page 12: Basic CUDA Programming

CUDA Computing Model

Serial Code

Parallel Code

Serial Code

Parallel Code

Host

Serial Code

Parallel Code

Serial Code

Parallel Code

Host

Memory Transfer

Memory Transfer

Memory Transfer

Memory Transfer

Device

Lunch Kernel

Lunch Kernel

Page 13: Basic CUDA Programming

Data Manipulation between Host and Device

• cudaError_t cudaMalloc( void** devPtr, size_t count )– Allocates count bytes of linear memory on the device and return in

*devPtr as a pointer to the allocated memory

• cudaError_t cudaMemcpy( void* dst, const void* src, size_t count, enum cudaMemcpyKind kind)– Copies count bytes from memory area pointed to by src to the

memory area pointed to by dst– kind indicates the type of memory transfer

• cudaMemcpyHostToHost• cudaMemcpyHostToDevice• cudaMemcpyDeviceToHost• cudaMemcpyDeviceToDevice

• cudaError_t cudaFree( void* devPtr )– Frees the memory space pointed to by devPtr

Page 14: Basic CUDA Programming

Example• Functionality:

– Given an integer array A holding 8192 elements

– For each element in array A, calculate A[i]256 and leave the result in B[i]

Float GPU_kernel(int *B, int *A) {

// Create Pointers for Memory Space on Device

int *dA, *dB;

// Allocate Memory Space on Device

cudaMalloc( (void**) &dA, sizeof(int)*SIZE );

cudaMalloc( (void**) &dB, sizeof(int)*SIZE );

// Copy Data to be Calculated

cudaMemcpy( dA, A, sizeof(int)*SIZE, cudaMemcpyHostToDevice );

// Lunch Kernel

cuda_kernel<<<1,1>>>(dB, dA);

// Copy Output Back

cudaMemcpy( B, dB, sizeof(int)*SIZE, cudaMemcpyDeviceToHost );

// Free Memory Spaces on Device

cudaFree( dA );

cudaFree( dB );

}

cudaMemcpy( dB, B, sizeof(int)*SIZE, cudaMemcpyHostToDevice );

Page 15: Basic CUDA Programming

Now, go and finish your first Now, go and finish your first CUDA program !!!CUDA program !!!

Page 16: Basic CUDA Programming

• Download http://twins.ee.nctu.edu.tw/~skchen/lab1.zip

• Open project with Visual C++ 2008 ( lab1/cuda_lab/cuda_lab.vcproj )– main.cu

• Random input generation, output validation, result reporting

– device.cu• Lunch GPU kernel, GPU kernel code

– parameter.h• Fill in appropriate APIs

– GPU_kernel() in device.cu

Page 17: Basic CUDA Programming

Lab2: Make the Lab2: Make the Parallel Code Faster Parallel Code Faster

Page 18: Basic CUDA Programming

Parallel Processing in CUDA

• Parallel code can be partitioned into blocks and threads– cuda_kernel<<<nBlk, nTid>>>(…)

• Multiple tasks will be initialized, each with different block id and thread id

• The tasks are dynamically scheduled– Tasks within the same block will be scheduled on

the same stream multiprocessor• Each task take care of single data partition

according to its block id and thread id

Page 19: Basic CUDA Programming

Locate Data Partition by Built-in Variables

• Built-in Variables– gridDim

• x, y– blockIdx

• x, y– blockDim

• x, y, z– threadIdx

• x, y, z

Host

Kernel 1

Kernel 2

Device

Grid 1

Block(0, 0)

Block(1, 0)

Block(0, 1)

Block(1, 1)

Grid 2

Courtesy: NDVIA

Figure 3.2. An Example of CUDA Thread Organization.

Block (1, 1)

Thread(0,1,0)

Thread(1,1,0)

Thread(2,1,0)

Thread(3,1,0)

Thread(0,0,0)

Thread(1,0,0)

Thread(2,0,0)

Thread(3,0,0)

(0,0,1) (1,0,1) (2,0,1) (3,0,1)

Page 20: Basic CUDA Programming

Data Partition for Previous Example

… ……

head

length

When processing 64 integer data:

cuda_kernel<<<2, 2>>>(…)

int total_task = gridDim.x * blockDim.x ;int task_sn = blockIdx.x * blockDim.x + threadIdx.x ;

int length = SIZE / total_task ;int head = task_sn * length ;

TASK 0blockIdx.x = 0threadIdx.x = 0

TASK 1blockIdx.x = 0threadIdx.x = 1

TASK 2blockIdx.x = 1threadIdx.x = 0

TASK 3blockIdx.x = 1threadIdx.x = 1

Page 21: Basic CUDA Programming

Processing Single Data Partition

__global__ void cuda_kernel ( int *B, int *A ) {

int length = SIZE / total_task;

for ( int i = head ; i < head + length ; i++ ) {

B[i] = A[i]256;

}

}

return;

int total_task = gridDim.x * blockDim.x;

int task_sn = blockDim.x * blockIdx.x + threadIdx.x;

int head = task_sn * length;

Page 22: Basic CUDA Programming

Parallelize Your Parallelize Your Program !!!Program !!!

Page 23: Basic CUDA Programming

• Partition kernel into threads– Increase nTid from 1 to 512– Keep nBlk = 1

• Group threads into blocks– Adjust nBlk and see if it helps

• Maintain total number of threads below 512, e.g. nBlk * nTid < 512

Page 24: Basic CUDA Programming

Lab3: Resolve Lab3: Resolve Memory Contention Memory Contention

Page 25: Basic CUDA Programming

Parallel Memory Architecture

• Memory is divided into banks to achieve high bandwidth

• Each bank can service one address per cycle

• Successive 32-bit words are assigned to successive banks BANK15

BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

Page 26: Basic CUDA Programming

Lab2 Review

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

THREAD3

THREAD2

THREAD1

THREAD0

Iteration 1

CONFILICT!!!!

A[48]

A[32]

A[16]

A[ 0]

When processing 64 integer data:

cuda_kernel<<<1, 4>>>(…)

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

THREAD3

THREAD2

THREAD1

THREAD0

Iteration 2

CONFILICT!!!!

A[49]

A[33]

A[17]

A[ 1]

Page 27: Basic CUDA Programming

How about Interleave Accessing?

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

THREAD3

THREAD2

THREAD1

THREAD0

Iteration 1

NO CONFLICT

A[ 3]

A[ 2]

A[ 1]

A[ 0]

When processing 64 integer data:

cuda_kernel<<<1, 4>>>(…)

BANK15BANK14BANK13BANK12BANK11BANK10BANK9BANK8BANK7BANK6BANK5BANK4BANK3BANK2BANK1BANK0

THREAD3

THREAD2

THREAD1

THREAD0

Iteration 2

NO CONFLICT

A[ 7]

A[ 6]

A[ 5]

A[ 4]

Page 28: Basic CUDA Programming

Implementation of Interleave Accessing

• head = task_sn• stripe = total_task

cuda_kernel<<<1, 4>>>(…)

head

stripe

Page 29: Basic CUDA Programming

Improve Your Program Improve Your Program !!!!!!

Page 30: Basic CUDA Programming

• Modify original kernel code in interleaving manner– cuda_kernel() in device.cu

• Adjusting nBlk and nTid as in Lab2 and examine the effect– Maintain total number of threads

below 512, e.g. nBlk * nTid < 512

Page 31: Basic CUDA Programming

Thank You• http://twins.ee.nctu.edu.tw/~skchen/lab3.zip• Final project issue

– Subject: • Porting & optimizing any algorithm on any multi-core

– Demo: • 1 week after final exam @ ED412

– Group:• 1 ~ 2 person per group

* Group member & demo time should be registered after final exam @ ED412