Exercise 1 - Institute for Computational Engineering and...

1
CS377P Programming for Performance Assignment 4 Due: November 25, 2015, 7PM Submit your answers as a PDF file to Canvas. Submit code to the CS377P Gitolite service. Do not repeat the text of the questions in your answer sheet. Do not include scanned images or photographs. Use complete and short sentences when composing your answers. All code submitted in this assignment must be your own. This assignment requires use of CUDA. In particular, you must know how to: 1. Compile CUDA programs (use nvcc) 2. Use textures Your guide to CUDA programming is the NVIDIA CUDA C Programming Guide and the NVIDIA CUDA Runtime API. If you run into issues, post on Piazza, talk to your TA or instructor. Start early! Exercise 1 Implement a GPU vector addition program. This adds two float vectors a and b to form a separate vector c. You can initialize a and b to be random vectors. Check (on the CPU) that c contains the sum of a and b. Your kernel should be able to run with all legal threads per block and number of blocks. Use CUDA events to time the kernel. See the section Timing using CUDA events in this blog post. 1. Submit the code to the gitolite. 2. Let a and b contain 1048576 floats. Fix the threads per block to 256. Vary the number of blocks from 1 to 4096 in any suitable increment. Is there a point after which varying the number of blocks no longer affects performance? Plot a graph of number of blocks (x-axis) vs runtime (y-axis). 3. Let a and b contain 1048576 floats. Fix the number of thread blocks to 240. Vary the threads per block from 32 to 1024 in steps of 64. Plot a graph of number of threads (x-axis) vs runtime (y-axis). State your inferences. Also, compare against the graph in the previous question. 4. Declare a and b to be const restrict arguments to the GPU kernel. Pick an interesting (blocks, threads per block) pair from the above experiment and time this version of the kernel. Do you see any perfor- mance change? Exercise 2 Implement 2D convolution on the GPU. Reuse the image reading/writing code from Assignment 2 and the convolution matrices and images. Create texture objects (not texture references) for each of the channels. Use cudaAddressModeBorder instead of padding. See Texture memory in the NVIDIA CUDA Programming Guide for an explanation of these terms. Sample code for texture objects is available on the Parallel ForAll blog. Store the convolution matrix in constant memory. Pick appropriate threads per block and number of blocks. Use CUDA events to measure the kernel runtime. Your code must operate on char or int, not floats. 1. Submit the code to the gitolite. 2. List the runtimes for each input. Do not include memory transfer times. END. Fall 2015 1

Transcript of Exercise 1 - Institute for Computational Engineering and...

Page 1: Exercise 1 - Institute for Computational Engineering and ...sreepai/cs377p/fall2015/assignments/... · CS377P Programming for Performance Assignment 4 Due: November 25, 2015, 7PM

CS377P Programming for Performance Assignment 4 Due: November 25, 2015, 7PM

Submit your answers as a PDF file to Canvas.

Submit code to the CS377P Gitolite service.

Do not repeat the text of the questions in your answer sheet.

Do not include scanned images or photographs. Use complete and short sentences when composing your answers.

All code submitted in this assignment must be your own.

This assignment requires use of CUDA. In particular, you must know how to:

1. Compile CUDA programs (use nvcc)

2. Use textures

Your guide to CUDA programming is the NVIDIA CUDA C Programming Guide and the NVIDIA CUDA RuntimeAPI.

If you run into issues, post on Piazza, talk to your TA or instructor. Start early!

Exercise 1

Implement a GPU vector addition program. This adds two float vectors a and b to form a separate vector c. You caninitialize a and b to be random vectors. Check (on the CPU) that c contains the sum of a and b.

Your kernel should be able to run with all legal threads per block and number of blocks.

Use CUDA events to time the kernel. See the section Timing using CUDA events in this blog post.

1. Submit the code to the gitolite.

2. Let a and b contain 1048576 floats. Fix the threads per block to 256. Vary the number of blocks from 1 to4096 in any suitable increment. Is there a point after which varying the number of blocks no longer affectsperformance? Plot a graph of number of blocks (x-axis) vs runtime (y-axis).

3. Let a and b contain 1048576 floats. Fix the number of thread blocks to 240. Vary the threads per block from32 to 1024 in steps of 64. Plot a graph of number of threads (x-axis) vs runtime (y-axis). State your inferences.Also, compare against the graph in the previous question.

4. Declare a and b to be const restrict arguments to the GPU kernel. Pick an interesting (blocks,threads per block) pair from the above experiment and time this version of the kernel. Do you see any perfor-mance change?

Exercise 2

Implement 2D convolution on the GPU. Reuse the image reading/writing code from Assignment 2 and the convolutionmatrices and images.

Create texture objects (not texture references) for each of the channels. Use cudaAddressModeBorder instead ofpadding. See Texture memory in the NVIDIA CUDA Programming Guide for an explanation of these terms. Samplecode for texture objects is available on the Parallel ForAll blog.

Store the convolution matrix in constant memory. Pick appropriate threads per block and number of blocks. UseCUDA events to measure the kernel runtime.

Your code must operate on char or int, not floats.

1. Submit the code to the gitolite.

2. List the runtimes for each input. Do not include memory transfer times.

END.

Fall 2015 1