Lecture 2: Introduction to Parallel Computing Using CUDA

50
Lecture 2: Introduction to Parallel Computing Using CUDA Ken Domino, Domem Technologies May 9, 2011 IEEE Boston Continuing Education Program

description

Lecture 2: Introduction to Parallel Computing Using CUDA. IEEE Boston Continuing Education Program. Ken Domino, Domem Technologies May 9, 2011. Announcements. Course website updates: Syllabus- http://domemtech.com/ieee-pp/Syllabus.docx - PowerPoint PPT Presentation

Transcript of Lecture 2: Introduction to Parallel Computing Using CUDA

Introduction to Parallel Computing Using CUDA

Lecture 2: Introduction to Parallel Computing Using CUDAKen Domino, Domem TechnologiesMay 9, 2011

IEEE Boston Continuing Education Program1AnnouncementsCourse website updates:Syllabus- http://domemtech.com/ieee-pp/Syllabus.docxLecture1 http://domemtech.com/ieee-pp/Lecture1.pptxLecture2 http://domemtech.com/ieee-pp/Lecture2.pptxReferences- http://domemtech.com/ieee-pp/References.docx

Ocelot April 5 download is not working

2PRAMParallel Random Access Machine (PRAM).Idealized SIMD parallel computing model.

Unlimited RAMs, called Processing Units (PU).RAMs operate with same instructions and synchronously.Shared Memory unlimited, accessed in one unit time.Shared Memory access is one of CREW, CRCW, EREW.Communication between RAMs is only through Shared Memory.

3PRAM pseudo codeParallel for loopfor Pi , 1 i n in parallel do end(aka data-level parallelism)

4SynchronizationA simple example from C:

5SynchronizationWhat happens if we have two threads competing for the same resources (char_in/char_out)?

6What happens if two threads execute this code serially?

SynchronizationNo prob!7What happens if two threads execute this code in parallel? We can sometimes get a problem.Synchronizationchar_in of T2 overwrites char_in of T1!

8SynchronizationSynchronization forces thread serialization, e.g., so concurrent access does not cause problems.

9SynchronizationTwo types:Mutual exclusion, using a mutex semaphore = a lock

Cooperation, wait on an object until all other threads ready, using wait() + notify(), barrier synchronization

10DeadlockThe use of mutual exclusion of two or more resources.

11PRAM Synchronizationstay idle wait until other processors complete, cooperative synchronization

12CUDACompute Unified Device ArchitectureDeveloped by NVIDIA, introduced November 2006Based on C, extended later to work with C++.CUDA provides three key abstractions:a hierarchy of thread groupsshared memoriesbarrier synchronization

http://www.nvidia.com/object/IO_37226.html, http://www.gpgpu.org/oldsite/sc2006/workshop/presentations/Buck_NVIDIA_Cuda.pdf,Nickolls, J., Buck, I., Garland, M. and Skadron, K. Scalable parallel programming with CUDA. Queue, 6 (2). 40-53.13GPU coprocessor to CPU

PCIe - Peripheral Component Interconnect Express, is a standard for a bus/card interface.

Note: This is a very old generation of the Intel chipset architecture. In Sandy Bridge, most of the chipset has migrated to the CPU.14NVIDIA GPU Architecture

Multiprocessor (MP) = texture/processorcluster (TPC)

Dynamic random-access memory(DRAM) aka global memory

Raster operation processor (ROP)

L2 Level-2 memory cacheFor more details, see Lindholm, E., Nickolls, J., Oberman, S. and Montrym, J. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro. 39-55.

For NVIDIAs GTX 590: the DRAM is 3GB of GDDR5, clock speed 1.7GHz, bandwidth 328 GB/s; Processor clock at 1215 MHz; in Fermi architecture, TPC is just called MP because

15NVIDIA GPU Architecture

1st generation, G80 20063rd generation, Fermi, GTX 570 - 2010Streaming Multiprocessor (SM)

Streaming processor (SP)

Streaming multiprocessor control (SMC)

Texture processing unit (TPU)

Con Cache constant memory

Sh. Memory shared memory

Multithreaded instruction fetch and issue unit (MTIFI)

TPC works on graphics computations, like rasterization.

For more information, see NVIDIAs Next Generation CUDATM Compute Architecture: Fermi Whitepaper, NVIDIA, 2009.

16Single-instruction, multiple-threadSIMTSIMT = SIMD + SPMD (single program, multiple data).Multiple threads.Sort of Single Instructionexcept that each instruction executed is in multiple independent parallel threads.Instruction set architecture: a register-based instruction set including floating-point, integer, bit, conversion, transcendental, flow control, memory load/store, and texture operations.

17Single-instruction, multiple-threadThe Stream Multiprocessor is a hardware multithreaded unit.Threads are executed in groups of 32 parallel threads called warps.Each thread has its own set of registers.Individual threads composing a warp are of the same program and start together at the same program address, but they are otherwise free to branch and execute independently.18

Single-instruction, multiple-threadInstruction executed is same for each warp.If threads of a warp diverge via a data dependent conditional branch, the warp serially executes each branch path taken.

A SIMT instruction controls the execution and branching behavior of one thread.

For more information, see NVIDIAs Next Generation CUDATM Compute Architecture: Fermi Whitepaper, NVIDIA, 2009.Also, Lindholm, E., Nickolls, J., Oberman, S. and Montrym, J. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro. 39-55.19Warps are serialized if there is:Divergence in instructions (i.e., conditional branch instruction)write access to the same memorySingle-instruction, multiple-thread

20Warp SchedulingSM hardware implements near-zero overheadWarp schedulingWarps whose next instruction has its operands ready for consumption can be executedEligible Warps are selected for execution by priorityAll threads in a Warp execute the same instruction4 clock cycles needed to dispatch the instruction for all threads (G80)21Cooperative Thread Array (CTA)An abstraction to synchronizing threadsAKA a thread block, gridCTAs are mapped to warps

22Each thread has a unique integer thread ID (TID).Threads of a CTA share data in global or shared memoryThreads synchronize with the barrier instruction.CTA thread programs use their TIDs to select work and index shared data arrays.

Cooperative Thread Array (CTA)23The programmer declares a 1D, 2D, or 3D grid shape and dimensions in threads.The TID is 1D, 2D, or 3D indice.

Cooperative Thread Array (CTA)24Restrictions in grid sizes

25KernelEvery thread in a grid executes the same body of instructions, called a kernel.In CUDA, its just a function.26CUDA KernelsKernels declared with __global__ voidParameters are the same for all threads.__global__ void fun(float * d, int size){ int idx = threadIdx.x + blockDim.x * blockIdx.x + blockDim.x * gridDim.x * blockDim.y * blockIdx.y + blockDim.x * gridDim.x * threadIdx.y; if (idx < 0) return; if (idx >= size) return; d[idx] = idx * 10.0 / 0.1;}27CUDA KernelsKernels are called via chevron syntaxFunc>(parameters)Dg is of type dim3 and specifies the dimension and size of the gridDb is of type dim3 and specifies the dimension and size of the blockDg is of type dim3 and specifies the dimension and size of the gridNs is of type size_t and specifies the number of bytes in shared memory that is dynamically allocated per blockS is of type cudaStream_t and specifies the associated stream Kernel is void type; must return value through cbv parameter

Example:Foo Custom Build Rules, check box for CUDA 4.0 targetsAdd hw.cu into your empty project

Note: .cu suffix stands for CUDA source code. You can put CUDA syntax into .cpp files, but build environment wont know what to compile it with (cl/g++ vs nvcc).42Developing CUDA programs#include

__global__ void fun(int * mem){*mem = 1;}

int main(){int h = 0;int * d;cudaMalloc(&d, sizeof(int));cudaMemcpy(d, &h, sizeof(int), cudaMemcpyHostToDevice);fun(d);cudaThreadSynchronize();int rv = cudaGetLastError();cudaMemcpy(&h, d, sizeof(int), cudaMemcpyDeviceToHost);printf("Result = %d\n", h);return 0;}hw.cu:43

Developing CUDA programsCompile, link, and run

(Version 4.0 installation adjusts all environmental variables.)

44NVCCnvcc (NVIDIA CUDA compiler) is a driver program for compiler phases

Use keep option to see intermediate files. (Need to add . to include directories on compile.)

45NVCCCompiles to .cu into a .cu.cpp fileTwo types of targets: virtual and real, represented in PTX assembly code and cubin binary code, respectively.

46PTXASCompiles PTX assembly code into machine code, placed in an ELF module.

# cat hw.sm_10.cubin | od -t x1 | head0000000 7f 45 4c 46 01 01 01 33 02 00 00 00 00 00 00 000000020 02 00 be 00 01 00 00 00 00 00 00 00 34 18 00 000000040 34 00 00 00 0a 01 0a 00 34 00 20 00 03 00 28 000000060 16 00 01 00 00 00 00 00 00 00 00 00 00 00 00 000000100 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 000000120 00 00 00 00 00 00 00 00 00 00 00 00 01 00 00 000000140 03 00 00 00 00 00 00 00 00 00 00 00 a4 03 00 000000160 7f 01 00 00 00 00 00 00 00 00 00 00 04 00 00 000000200 00 00 00 00 0b 00 00 00 03 00 00 00 00 00 00 000000220 00 00 00 00 23 05 00 00 22 00 00 00 00 00 00 00

Disassembly of the machine code can be done using cuobjectdump or my own utility nvdis (http://forums.nvidia.com/index.php?showtopic=183438)

47PTX, the GPU assembly code.version 1.4.target sm_10, map_f64_to_f32

// compiled with /be.exe// nvopencc 4.0 built on 2011-03-24

.entry _Z3funPi (.param .u32 __cudaparm__Z3funPi_mem){.reg .u32 %r;.loc1640$LDWbegin__Z3funPi:.loc1660mov.s32 %r1, 1;ld.param.u32 %r2, [__cudaparm__Z3funPi_mem];st.global.s32 [%r2+0], %r1;.loc1670exit;$LDWend__Z3funPi:} // _Z3funPi

PTX = Parallel Thread Execution

Target for PTX is an abstract GPU machine.

Contains operations for load, store, register declarations, add, sub, mul, etc.

For more information on PTX, see the NVIDIA PTX: Parallel Thread ExecutionISA Version 2.348CUDA GPU targetsVirtual PTX code is embedded in executabe as a string, then compiled at runtime just-in-time.

Real PTX code is compiled into target execute.

49Next timeFor next week, we will go into more detail:The CUDA runtime API;Writing efficient CUDA code;Look at some important examples.50