Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
-
Upload
ady-maryan -
Category
Documents
-
view
243 -
download
0
Transcript of Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
1/12
Introduction to CUDA
- Data Parallelism and Threads
Lesson 1.4
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
2/12
O
To learn about data paralle
the basic features of CUDA
heterogeneous parallel prog
interface that enables expl
of data parallelism
Hierarchical thread orga
Main interfaces for laun
parallel execution
Thread index to data ind
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
3/12
A[0]vector A
vector B
vector C
A[1] A[2]
B[0] B[1] B[2]
C[0] C[1] C[2]
+
+ +
Data Parallelism - Vector Additio
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
4/12
CUDA /OpenCL Executi
Heterogeneous host+device applica
Serial parts in host C code Parallel parts in device SPMD ker
Serial Code (host)
Parallel Kernel (device)
KernelA>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB>(args);
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
5/12
From Natural Language to E
Natural Language (e.g, En
Algorithm
High-Level Language (C/CInstruction Set Architec
Microarchitecture
Circuits
Electrons
Yale Patt and Sanjay Patel, From bits an
Compiler
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
6/12
An Instruction Set Architec
is a contract between the h
and the software.
As the name suggests, it is
instructions that the archi
(hardware) can execute.
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
7/12
A program at the I
A program is a set of instr
stored in memory that can b
interpreted, and executed b
hardware.
Program instructions operat
stored in memory or provide
Input/Output (I/O) device.
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
8/12
A Von-Neumann
Memory
Control Unit
ALUReg
File
PC I
Processing Unit
A thread is a virtualized or
abstracted
Von-Neumann Processor
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
9/12
Arrays of Parallel T
A CUDA kernel is executed by a gthreads
All threads in a grid run the sa(SPMD)
Each thread has indexes thatcompute memory addresses decisions
i = blockIdx.x * blo
threadIdx.x
C[i] = A[i] + B
0 1 2 25
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
10/12
Thread Blocks: Scalable Coo
Divide thread array into mul
Threads within a block cshared memory, atomic opbarrier synchronization Threads in different blo
interact
i = blockIdx.x *
blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
0 1 2 254 255
Thread Block 0
1 2 254 255
Thread Block 1
0
i = blockIdx.x *
blockDim.x +
threadIdx.x;
C[i] = A[i] + B[i];
1
Threa
0
i =
bl
th
C[i] =
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
11/12
blockIdx and t
Each thread uses indices todecide what data to work on blockIdx: 1D, 2D, or 3D (CUDA
4.0)
threadIdx: 1D, 2D, or 3D
Simplifies memoryaddressing when processingmultidimensional data Image processing
Solving PDEs on volumes
device
GridBlock 0,0)
BlBlock 1,0)
Bl
Block
Threa
d
(0,1,
0)
Threa
d
(0,1,
1)
Threa
d
(0,1,
2)
Threa
d
(0,0,
0)
Threa
d
(0,0,
1)
Threa
d
(0,0,
2)
(1,0,0)(1,0,1) (1,
2)
-
8/13/2019 Hetero Lecture Slides 002 Lecture 1 Lecture 1 4 Cuda Intro
12/12
To learn more,
Chapt