CUDA lab's slides of "parallel programming" course

30
CUDA LAB LSALAB

description

online version: http://yszheda.github.io/CUDA-lab I made the slides as a part-time TA for the lab course. The slides are generated by the great reveal.js.

Transcript of CUDA lab's slides of "parallel programming" course

Page 1: CUDA lab's slides of "parallel programming" course

CUDA LABLSALAB

Page 2: CUDA lab's slides of "parallel programming" course

OVERVIEWProgramming EnvironmentCompile & Run CUDA programCUDA ToolsLab TasksCUDA Programming TipsReferences

Page 3: CUDA lab's slides of "parallel programming" course

GPU SERVERIntel E5-2670 V2 10Cores CPU X 2NVIDIA K20X GPGPU CARD X 2

Page 4: CUDA lab's slides of "parallel programming" course

Command to get your GPGPU HW spec:$ /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery

Device 0: "Tesla K20Xm" CUDA Driver Version / Runtime Version 5.5 / 5.5 CUDA Capability Major/Minor version number: 3.5 Total amount of global memory: 5760 MBytes (6039339008 bytes) (14) Multiprocessors, (192) CUDA Cores/MP: 2688 CUDA Cores GPU Clock rate: 732 MHz (0.73 GHz) Memory Clock rate: 2600 Mhz Memory Bus Width: 384-bit L2 Cache Size: 1572864 bytes Total amount of constant memory: 65536 bytes Total amount of shared memory per block: 49152 bytes Total number of registers available per block: 65536 Warp size: 32 Maximum number of threads per multiprocessor: 2048 Maximum number of threads per block: 1024 Max dimension size of a thread block (x,y,z): (1024, 1024, 64) Max dimension size of a grid size (x,y,z): (2147483647, 65535, 65535)

theoretical memory bandwidth: $2600 \times 10^{6} \times(384 / 8) \times 2 ÷ 1024^3 = 243 GB/s$

Official HW Spec details: http://www.nvidia.com/object/tesla-servers.html

Page 5: CUDA lab's slides of "parallel programming" course

COMPILE & RUN CUDADirectly compile to executable code

GPU and CPU code are compiled and linked separately

# compile the source code to executable file$ nvcc a.cu -o a.out

Page 6: CUDA lab's slides of "parallel programming" course

COMPILE & RUN CUDAThe nvcc compiler will translate CUDA source code into Parallel

Thread Execution (PTX) language in the intermediate phase.# keep all intermediate phase files$ nvcc a.cu -keep# or$ nvcc a.cu -save-temps

$ nvcc a.cu -keep$ lsa.cpp1.ii a.cpp4.ii a.cudafe1.c a.cudafe1.stub.c a.cudafe2.stub.c a.hash a.outa.cpp2.i a.cu a.cudafe1.cpp a.cudafe2.c a.fatbin a.module_id a.ptxa.cpp3.i a.cu.cpp.ii a.cudafe1.gpu a.cudafe2.gpu a.fatbin.c a.o a.sm_10.cubin

# clean all intermediate phase files$ nvcc a.cu -keep -clean

Page 7: CUDA lab's slides of "parallel programming" course

USEFUL NVCC USAGEPrint code generation statistics:

$ nvcc -Xptxas -v reduce.cuptxas info : 0 bytes gmemptxas info : Compiling entry function '_Z6reducePiS_' for 'sm_10'ptxas info : Used 6 registers, 32 bytes smem, 4 bytes cmem[1]

-Xptxas--ptxas-options Specify options directly to the ptx optimizing assembler.

register number: should be less than the number of availableregisters, otherwises the rest registers will be mapped intothe local memory (off-chip).smem stands for shared memory.cmem stands for constant memory. The bank-#1 constantmemory stores 4 bytes of constant variables.

Page 8: CUDA lab's slides of "parallel programming" course

CUDA TOOLScuda-memcheck: functional correctness checking suite.nvidia-smi: NVIDIA System Management Interface

Page 9: CUDA lab's slides of "parallel programming" course

CUDA-MEMCHECKThis tool checks the following memory errors of your program,and it also reports hardware exceptions encountered by theGPU. These errors may not cause program crash, but they couldunexpected program and memory misusage.

Table . Memcheck reported error typesName Description Location PrecisionMemory accesserror

Errors due to out of bounds or misaligned accesses to memory by a global,local, shared or global atomic access.

Device Precise

Hardwareexception

Errors that are reported by the hardware error reporting mechanism. Device Imprecise

Malloc/Free errors Errors that occur due to incorrect use of malloc()/free() in CUDA kernels. Device PreciseCUDA API errors Reported when a CUDA API call in the application returns a failure. Host PrecisecudaMallocmemory leaks

Allocations of device memory using cudaMalloc() that have not been freedby the application.

Host Precise

Device HeapMemory Leaks

Allocations of device memory using malloc() in device code that have notbeen freed by the application.

Device Imprecise

Page 10: CUDA lab's slides of "parallel programming" course

CUDA-MEMCHECKEXAMPLE

Program with double free faultint main(int argc, char *argv[]){ const int elemNum = 1024; int h_data[elemNum]; int *d_data; initArray(h_data); int arraySize = elemNum * sizeof(int); cudaMalloc((void **) &d_data, arraySize); incrOneForAll<<< 1, 1024 >>>(d_data); cudaMemcpy((void **) &h_data, d_data, arraySize, cudaMemcpyDeviceToHost); cudaFree(d_data); cudaFree(d_data); // fault printArray(h_data); return 0;}

Page 11: CUDA lab's slides of "parallel programming" course

CUDA-MEMCHECKEXAMPLE

$ nvcc -g -G example.cu$ cuda-memcheck ./a.out========= CUDA-MEMCHECK========= Program hit error 17 on CUDA API call to cudaFree========= Saved host backtrace up to driver entry point at error========= Host Frame:/usr/lib64/libcuda.so [0x26d660]========= Host Frame:./a.out [0x42af6]========= Host Frame:./a.out [0x2a29]========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xfd) [0x1ecdd]========= Host Frame:./a.out [0x2769]=========

No error is shown if it is run directly, but CUDA-MEMCHECKcan detect the error.

Page 12: CUDA lab's slides of "parallel programming" course

NVIDIA SYSTEM MANAGEMENT INTERFACE(NVIDIA-SMI)

Purpose: Query and modify GPU devices' state.$ nvidia-smi +------------------------------------------------------+ | NVIDIA-SMI 5.319.37 Driver Version: 319.37 | |-------------------------------+----------------------+----------------------+| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC || Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. ||===============================+======================+======================|| 0 Tesla K20Xm On | 0000:0B:00.0 Off | 0 || N/A 35C P0 60W / 235W | 84MB / 5759MB | 0% Default |+-------------------------------+----------------------+----------------------+| 1 Tesla K20Xm On | 0000:85:00.0 Off | 0 || N/A 39C P0 60W / 235W | 14MB / 5759MB | 0% Default |+-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+| Compute processes: GPU Memory || GPU PID Process name Usage ||=============================================================================|| 0 33736 ./RS 69MB |+-----------------------------------------------------------------------------+

Page 13: CUDA lab's slides of "parallel programming" course

NVIDIA-SMIYou can query more specific information on temperature,

memory, power, etc.$ nvidia-smi -q -d [TEMPERATURE|MEMORY|POWER|CLOCK|...]

For example:$ nvidia-smi -q -d POWER==============NVSMI LOG==============Timestamp : Driver Version : 319.37Attached GPUs : 2GPU 0000:0B:00.0 Power Readings Power Management : Supported Power Draw : 60.71 W Power Limit : 235.00 W Default Power Limit : 235.00 W Enforced Power Limit : 235.00 W Min Power Limit : 150.00 W Max Power Limit : 235.00 W

GPU 0000:85:00.0 Power Readings Power Management : Supported Power Draw : 31.38 W Power Limit : 235.00 W Default Power Limit : 235.00 W

Page 14: CUDA lab's slides of "parallel programming" course

LAB ASSIGNMENTS1. Program-#1: increase each element in an array by one.

(You are required to rewrite a CPU program into a CUDAone.)

2. Program-#2: use parallel reduction to calculate the sum of allthe elements in an array. (You are required to fill in the blanks of a template CUDAprogram, and report your GPU bandwidth to TA after youfinish each assignment.)1. SUM CUDA programming with "multi-kernel and shared

memory"2. SUM CUDA programming with "interleaved addressing"3. SUM CUDA programming with "sequential addressing"4. SUM CUDA programming with "first add during load"

0.2 scores per task.

Page 15: CUDA lab's slides of "parallel programming" course

LABS ASSIGNMENT #1Rewrite the following CPU function into a CUDA kernel

function and complete the main function by yourself:// increase one for all the elementsvoid incrOneForAll(int *array, const int elemNum){ int i; for (i = 0; i < elemNum; ++i) { array[i] ++; }}

Page 16: CUDA lab's slides of "parallel programming" course

LABS ASSIGNMENT #2Fill in the CUDA kernel function:

Part of the main function is given, you are required to fill in theblanks according to the comments:

__global__ void reduce(int *g_idata, int *g_odata){ extern __shared__ int sdata[];

// TODO: load the content of global memory to shared memory // NOTE: synchronize all the threads after this step

// TODO: sum calculation // NOTE: synchronize all the threads after each iteration

// TODO: write back the result into the corresponding entry of global memory // NOTE: only one thread is enough to do the job}

// parameters for the first kernel// TODO: set grid and block size// threadNum = ?// blockNum = ?int sMemSize = 1024 * sizeof(int);reduce<<< threadNum, blockNum, sMemSize >>>(d_idata, d_odata);

Page 17: CUDA lab's slides of "parallel programming" course

Hint: for "first add during global load" optimization (Assignment

#2-4), the third kernel is unnecessary.

LABS ASSIGNMENT #2Given $10^{22}$ INTs, each block has the maximum blocksize $10^{10}$How to use 3 kernel to synchronize between iterations?

Page 18: CUDA lab's slides of "parallel programming" course

LABS ASSIGNMENT #2-1Implement the naïve data parallelism assignment as follows:

Page 19: CUDA lab's slides of "parallel programming" course

LABS ASSIGNMENT #2-2Reduce number of active warps of your program:

Page 20: CUDA lab's slides of "parallel programming" course

LABS ASSIGNMENT #2-3Prevent shared memory access bank confliction:

Page 21: CUDA lab's slides of "parallel programming" course

LABS ASSIGNMENT #2-4Reduce the number of blocks in each kernel:Notice:

Only 2 kernels are needed in this case because each kernelcan now process twice amount of data than before.Global memory should be accessed in a sequentialaddressing way.

Page 22: CUDA lab's slides of "parallel programming" course

CUDA PROGRAMMING TIPS

Page 23: CUDA lab's slides of "parallel programming" course

KERNEL LAUNCHmykernel <<< gridSize, blockSize, sMemSize, streamID >>> (args);

gridSize: number of blocks per gridblockSize: number of threads per blocksMemSize[optional]: shared memory size (in bytes)streamID[optional]: stream ID, default is 0

Page 24: CUDA lab's slides of "parallel programming" course

BUILT-IN VARIABLES FOR INDEXING IN AKERNEL FUNCTION

blockIdx.x, blockIdx.y, blockIdx.z: block indexthreadIdx.x, threadIdx.y, threadIdx.z: thread indexgridDim.x, gridDim.y, gridDim.z: grid size (number of blocksper grid) per dimensionblockDim.x, blockDim.y, blockDim.z: block size (number ofthreads per block) per dimension

Page 25: CUDA lab's slides of "parallel programming" course

CUDAMEMCPYcudaError_t cudaMemcpy ( void *dst,const void *src,size_t count,enum cudaMemcpyKind kind )

Enumerator:

cudaMemcpyHostToHost: Host -> HostcudaMemcpyHostToDevice: Host -> DevicecudaMemcpyDeviceToHost; Device -> HostcudaMemcpyDeviceToDevice: Device -> Device

Page 26: CUDA lab's slides of "parallel programming" course

SYNCHRONIZATION__synthread(): synchronizes all threads in a block (used insidethe kernel function).cudaDeviceSynchronize(): blocks until the device hascompleted all preceding requested tasks (used between twokernel launches).

kernel1 <<< gridSize, blockSize >>> (args);cudaDeviceSynchronize();kernel2 <<< gridSize, blockSize >>> (args);

Page 27: CUDA lab's slides of "parallel programming" course

HOW TO MEASURE KERNEL EXECUTION TIMEUSING CUDA GPU TIMERS

Methods:

cudaEventCreate(): init timercudaEventDestory(): destory timercudaEventRecord(): set timercudaEventSynchronize(): sync timer after each kernel callcudaEventElapsedTime(): returns the elapsed time inmilliseconds

Page 28: CUDA lab's slides of "parallel programming" course

Example:

HOW TO MEASURE KERNEL EXECUTION TIMEUSING CUDA GPU TIMERS

cudaEvent_t start, stop;float time;

cudaEventCreate(&start);cudaEventCreate(&stop);

cudaEventRecord( start, 0 );kernel<<< grid,threads >>> (d_idata, d_odata);cudaEventRecord( stop, 0 );cudaEventSynchronize( stop );

cudaEventElapsedTime( &time, start, stop );cudaEventDestroy( start );cudaEventDestroy( stop );

Page 30: CUDA lab's slides of "parallel programming" course

THE ENDENJOY CUDA & HAPPY NEW YEAR!