Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro

8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro

1/12

Introduction to CUDA

- Data Parallelism and Threads

Lesson 1.4


2/12

O

To learn about data paralle

the basic features of CUDA

heterogeneous parallel prog

interface that enables expl

of data parallelism

Hierarchical thread orga

Main interfaces for laun

parallel execution

Thread index to data ind


3/12

A[0]vector A

vector B

vector C

A[1] A[2]

B[0] B[1] B[2]

C[0] C[1] C[2]

+

+ +

Data Parallelism - Vector Additio


4/12

CUDA /OpenCL Executi

Heterogeneous host+device applica

Serial parts in host C code Parallel parts in device SPMD ker

Serial Code (host)

Parallel Kernel (device)

KernelA>(args);

Serial Code (host)

Parallel Kernel (device)

KernelB>(args);


5/12

From Natural Language to E

Natural Language (e.g, En

Algorithm

High-Level Language (C/CInstruction Set Architec

Microarchitecture

Circuits

Electrons

Yale Patt and Sanjay Patel, From bits an

Compiler


6/12

An Instruction Set Architec

is a contract between the h

and the software.

As the name suggests, it is

instructions that the archi

(hardware) can execute.


7/12

A program at the I

A program is a set of instr

stored in memory that can b

interpreted, and executed b

hardware.

Program instructions operat

stored in memory or provide

Input/Output (I/O) device.


8/12

A Von-Neumann

Memory

Control Unit

ALUReg

File

PC I

Processing Unit

A thread is a virtualized or

abstracted

Von-Neumann Processor


9/12

Arrays of Parallel T

A CUDA kernel is executed by a gthreads

All threads in a grid run the sa(SPMD)

Each thread has indexes thatcompute memory addresses decisions

i = blockIdx.x * blo

threadIdx.x

C[i] = A[i] + B

0 1 2 25


10/12

Thread Blocks: Scalable Coo

Divide thread array into mul

Threads within a block cshared memory, atomic opbarrier synchronization Threads in different blo

interact

i = blockIdx.x *

blockDim.x +

threadIdx.x;

C[i] = A[i] + B[i];

0 1 2 254 255

Thread Block 0

1 2 254 255

Thread Block 1

0

i = blockIdx.x *

blockDim.x +

threadIdx.x;

C[i] = A[i] + B[i];

1

Threa

0

i =

bl

th

C[i] =


11/12

blockIdx and t

Each thread uses indices todecide what data to work on blockIdx: 1D, 2D, or 3D (CUDA

4.0)

threadIdx: 1D, 2D, or 3D

Simplifies memoryaddressing when processingmultidimensional data Image processing

Solving PDEs on volumes

device

GridBlock 0,0)

BlBlock 1,0)

Bl

Block

Threa

d

(0,1,

0)

Threa

d

(0,1,

1)

Threa

d

(0,1,

2)

Threa

d

(0,0,

0)

Threa

d

(0,0,

1)

Threa

d

(0,0,

2)

(1,0,0)(1,0,1) (1,

2)


12/12

To learn more,

Chapt

Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro

Documents

Transcript of Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro