Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To...

16
1 © David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign 1 ECE 8823A GPU Architectures Module 2: Introduction to CUDA C Objective To understand the major elements of a CUDA program Introduce the basic constructs of the programming model Illustrate the preceding with a simple but complete CUDA program 2

Transcript of Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To...

Page 1: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

1

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011ECE408/CS483, University of Illinois, Urbana-Champaign

1

ECE 8823A

GPU Architectures

Module 2: Introduction to CUDA C

Objective

• To understand the major elements of a CUDA program

• Introduce the basic constructs of the programming model

• Illustrate the preceding with a simple but complete CUDA program

2

Page 2: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

2

Reading Assignment

• Kirk and Hwu, “Programming Massively Parallel Processors: A Hands on Approach,”, Chapter 3

• CUDA Programming Guide– http://docs.nvidia.com/cuda/cuda-c-

programming-guide/#abstract

3

CUDA/OpenCL- Execution Model

• Reflects a multicore processor + GPGPU execution model

• Addition of device functions and declarations to stock C programs

• Compilation separated into host and GPU paths

4

Page 3: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

3

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Integrated C programs with CUDA extensions

NVCC Compiler

Host C Compiler/ Linker

Host Code Device Code (PTX)

Device Just-in-Time Compiler

Heterogeneous Computing Platform withCPUs, GPUs

Compiling A CUDA Program

5

Virtual ISA

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

CUDA /OpenCL – Execution Model• Integrated host+device app C program

– Serial or modestly parallel parts in host C code– Highly parallel parts in device SPMD kernel C code

Serial Code (host)

. . .

. . .

Parallel Kernel (device) KernelA<<< nBlk, nTid >>>(args);

Serial Code (host)

Parallel Kernel (device) KernelB<<< nBlk, nTid >>>(args);

6

Page 4: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

4

Thread Hierarchy

7From http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#memory-hierarchy

• All threads execute the same kernel code

• Memory hierarchy is tied to the thread hierarchy (later)

CUDA /OpenCL Programming Model: The Grid

8

CPU

Operating System

User Application

Software Stack

Host Tasks

GPU Tasks

GPU

Each kernelN-Dimensional Range

Note: Each thread executes the same kernel code!

gridDim.x

grid

Dim

.y

3D thread block

Page 5: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

5

A[0]vector A

vector B

vector C

A[1] A[2] A[3] A[4] A[N-1]

B[0] B[1] B[2] B[3]

B[4] … B[N-1]

C[0] C[1] C[2] C[3] C[4] C[N-1]…

+ + + + + +

Vector Addition – Conceptual View

9© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

A Simple Thread Block Model

10

blockdim

blockIdx = 0 blockIdx = 1 blockIdx = 2 blockIdx = gridDim-1

threadIdx = 0 threadIdx = blockDim-1

• Structure an array of threads into thread blocks• A unique id can be computed for each thread• This id is used for workload partitioning

id = blockIdx * blockDim + threadIdx;

Page 6: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

6

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Using Arrays of Parallel Threads

• A CUDA kernel is executed by a grid (array) ofthreads – All threads in a grid run the same kernel code (SPMD) – Each thread has an index that it uses to compute

memory addresses and make control decisions

i = blockIdx.x * blockDim.x + threadIdx.x;

C_d[i] = A_d[i] + B_d[i];

…0 1 2 254 255

11

Thread blocks can be multidimensional arrays of threads

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Thread Blocks: Scalable Cooperation• Divide thread array into multiple blocks

– Threads within a block cooperate via shared memory, atomic operations and barrier synchronization

– Threads in different blocks cannot cooperate (only through global memory)

i = blockIdx.x * blockDim.x+ threadIdx.x;

C_d[i] = A_d[i] + B_d[i];

…0 1 2 254 255

Thread Block 0…1 2 254 255

Thread Block 10

i = blockIdx.x * blockDim.x+ threadIdx.x;

C_d[i] = A_d[i] + B_d[i];

…1 2 254 255

Thread Block N-10

i = blockIdx.x * blockDim.x+ threadIdx.x;

C_d[i] = A_d[i] + B_d[i];

…… …

12

Page 7: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

7

Vector Addition – Traditional C Code

// Compute vector sum C = A+B

void vecAdd(float* A, float* B, float* C, int n)

{

for (i = 0, i < n, i++)

C[i] = A[i] + B[i];

}

int main(){

// Memory allocation for A_h, B_h, and C_h// I/O to read A_h and B_h, N elements…

vecAdd(A_h, B_h, C_h, N);}

13© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

#include <cuda.h>void vecAdd(float* A, float* B, float* C, int n) {

int size = n* sizeof(float); float* A_d, B_d, C_d;…

1. // Allocate device memory for A, B, and C// copy A and B to device memory

2. // Kernel launch code – to have the device// to perform the actual vector addition

3. // copy C from the device memory// Free device vectors

}

CPU

Host Memory

GPUPart 2

Device Memory

Part 1

Part 3

Heterogeneous Computing vecAdd Host Code

14© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Page 8: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

8

Partial Overview of CUDA Memories

• Device code can:– R/W per-thread registers– R/W per-grid global memory

• Host code can– Transfer data to/from per grid global

memory

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

15© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

We will cover more later.

Grid

Global Memory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Host

CUDA Device Memory Management API functions

• cudaMalloc()– Allocates object in the device

global memory– Two parameters

• Address of a pointer to the allocated object

• Size of of allocated object in terms of bytes

• cudaFree()– Frees object from device

global memory• Pointer to freed object

16© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Page 9: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

9

Host

Host-Device Data Transfer API functions

• cudaMemcpy()– memory data transfer– Requires four parameters

• Pointer to destination • Pointer to source• Number of bytes copied• Type/Direction of transfer

– Transfer to device is asynchronous

(Device) Grid

GlobalMemory

Block (0, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

Block (1, 0)

Thread (0, 0)

Registers

Thread (1, 0)

Registers

17© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

18© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

void vecAdd(float* A, float* B, float* C, int n){

int size = n * sizeof(float); float* A_d, B_d, C_d;

1. // Transfer A and B to device memory cudaMalloc((void **) &A_d, size);cudaMemcpy(A_d, A, size, cudaMemcpyHostToDevice);cudaMalloc((void **) &B_d, size);cudaMemcpy(B_d, B, size, cudaMemcpyHostToDevice);

// Allocate device memory forcudaMalloc((void **) &C_d, size);

2. // Kernel invocation code – to be shown later…

3. // Transfer C from device to hostcudaMemcpy(C, C_d, size, cudaMemcpyDeviceToHost);// Free device memory for A, B, C

cudaFree(A_d); cudaFree(B_d); cudaFree (C_d);}

Page 10: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

10

19© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__void vecAddKernel(float* A_d, float* B_d, float* C_d, int n){

int i = threadIdx.x + blockDim.x * blockIdx.x;

if(i<n) C_d[i] = A_d[i] + B_d[i];}

int vectAdd(float* A, float* B, float* C, int n){

// A_d, B_d, C_d allocations and copies omitted

// Run ceil(n/256) blocks of 256 threads each

vecAddKernel<<<ceil(n/256), 256>>>(A_d, B_d, C_d, n);

}

Example: Vector Addition KernelDevice Code

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Example: Vector Addition Kernel// Compute vector sum C = A+B

// Each thread performs one pair-wise addition

__global__void vecAddkernel(float* A_d, float* B_d, float* C_d, int n)

{int i = threadIdx.x + blockDim.x * blockIdx.x;

if(i<n) C_d[i] = A_d[i] + B_d[i];

}

int vecAdd(float* A, float* B, float* C, int n){// A_d, B_d, C_d allocations and copies omitted

// Run ceil(n/256) blocks of 256 threads each

vecAddKernnel<<<ceil(n/256),256>>>(A_d, B_d, C_d, n);}

Host Code

20

Execution Configuration

Page 11: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

11

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

More on Kernel Launch

int vecAdd(float* A, float* B, float* C, int n){// A_d, B_d, C_d allocations and copies omitted

// Run ceil(n/256) blocks of 256 threads each

dim3 DimGrid(n/256, 1, 1);if (n%256) DimGrid.x++;dim3 DimBlock(256, 1, 1);

vecAddKernnel<<<DimGrid,DimBlock>>>(A_d, B_d, C_d, n);}

• Any call to a kernel function is asynchronous from CUDA 1.0 on, explicit synch needed for blocking à cudaDeviceSynchronize();

Host Code

21

Edge effects

__global__void vecAddKernel(float *A_d,

float *B_d, float *C_d, int n){

int i = blockIdx.x * blockDim.x+ threadIdx.x;

if( i<n ) C_d[i] = A_d[i]+B_d[i];}

__host__Void vecAdd(){

dim3 DimGrid = (ceil(n/256,1,1);dim3 DimBlock = (256,1,1);

vecAddKernel<<<DimGrid,DimBlock>>>(A_d,B_d,C_d,n);}

Kernel Execution in a Nutshell

KernelBlk 0 BlkN-1• • •

GPUM0

RAM

Mk• • •

Schedule onto multiprocessors

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

22

Page 12: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

12

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

More on CUDA Function Declarations

hosthost__host__ float HostFunc()

hostdevice__global__ void KernelFunc() devicedevice__device__ float DeviceFunc()

Only callable from the:

Executed on the:

• __global__ defines a kernel function• Each “__” consists of two underscore characters• A kernel function must return void

• __device__ and __host__ can be used together

23

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

Integrated C programs with CUDA extensions

NVCC Compiler

Host C Compiler/ Linker

Host Code Device Code (PTX)

Device Just-in-Time Compiler

Heterogeneous Computing Platform withCPUs, GPUs

Compiling A CUDA Program

24

Page 13: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

13

CPU and GPU Address Spaces

25

CPU GPUPCIe

X= X=

• Requires explicit management in each address space

• Programmer initiated transfers between address spaces

Unified Memory

26

CPU GPUPCIe

X= Single unified address space across CPU and GPU

• All transfers managed under the hood• Smaller code segments• Explicit management now becomes an

optimization

Page 14: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

14

Using Unified Memory#include <iostream> #include <math.h> // Kernel function to add the elements of two arrays __global__ void add(int n, float *x, float *y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } int main(void) { int N = 1<<20; float *x, *y; // Allocate Unified Memory – accessible from CPU or GPU cudaMallocManaged(&x, N*sizeof(float)); cudaMallocManaged(&y, N*sizeof(float));

// initialize x and y arrays on the host for (int i = 0; i < N; i++) { x[i] = 1.0f; y[i] = 2.0f; } // Run kernel on 1M elements on the GPU add<<<1, 1>>>(N, x, y); // Wait for GPU to finish cudaDeviceSynchronize(); // Free memory cudaFree(x); cudaFree(y); return 0; }

27From https://devblogs.nvidia.com/parallelforall/even-easier-introduction-cuda/

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

28

The ISA

• An Instruction Set Architecture (ISA) is a contract between the hardware and the software.

• As the name suggests, it is a set of instructions that the architecture (hardware) can execute.

Page 15: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

15

PTX ISA – Major Features

• Set of predefined, read only variables– E.g., %tid (threadIdx), %ntid (blockDim),

%ctaid (blockIdx), %nctaid(gridDim), etc. – Some of these are multidimensional

• E.g., %tid.x, %tid.y, %tid.z– Includes architecture variables

• E.g., %laneid, %warpid

• Can be used for auto-tuning of code• Includes undefined performance counters

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

29

PTX ISA – Major Features (2)

• Multiple address spaces– Register, parameter– Constant global, etc. – More later

• Predicated instruction execution

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

30

Page 16: Module 2: Introduction to CUDA C · 2020. 7. 31. · Introduction to CUDA C Objective •To understand the major elements of a CUDA program •Introduce the basic constructs of the

16

Versions

• Compute capability– Architecture specific, e.g., Volta is 7.0

• CUDA version– Determines programming model features– Currently 9.0

• PTX version– Virtual ISA version– Currently 6.1

31

QUESTIONS?

© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2011 ECE408/CS483, University of Illinois, Urbana-Champaign

32