GPU Architecture and Programming. GPU vs CPU .

21
GPU Architecture and Programming

Transcript of GPU Architecture and Programming. GPU vs CPU .

Page 1: GPU Architecture and Programming. GPU vs CPU .

GPU Architecture and Programming

Page 2: GPU Architecture and Programming. GPU vs CPU .

GPU vs CPUhttps://www.youtube.com/watch?v=fKK933KK6Gg

Page 3: GPU Architecture and Programming. GPU vs CPU .

GPU Architecture

• GPU (Graphics Processing Unit) were originally designed as graphics accelerators, used for real-time graphics rendering.

• Starting in the late 1990s, the hardware became increasingly programmable, culminating in NVIDIA's first GPU in 1999.

Page 4: GPU Architecture and Programming. GPU vs CPU .

• CPU + GPU is a powerful combination – CPUs consist of a few cores optimized for serial processing, – GPUs consist of thousands of smaller, more efficient cores

designed for parallel performance. – Serial portions of the code run on the CPU while parallel

portions run on the GPU

Page 5: GPU Architecture and Programming. GPU vs CPU .

Architecture of GPU

Image copied from http://www.pgroup.com/lit/articles/insider/v2n1a5.htm Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf

Page 6: GPU Architecture and Programming. GPU vs CPU .

CUDA Programming

• CUDA (Compute Unified Device Architecture) is a parallel programming platform created by NVIDIA based on its GPUs.

• By using CUDA, you can write programs that directly access GPU.

• CUDA platform is accessible to programmers via CUDA libraries and extensions to programming languages like C, C++ AND Fortran. – C/C++ programmers use “CUDA C/C++”, compiled with nvcc

compiler– Fortran programmers can use CUDA Fortran, compiled with PGI

CUDA Fortran

Page 7: GPU Architecture and Programming. GPU vs CPU .

• Terminology:– Host: The CPU and its memory (host memory)– Device: The GPU and its memory (device memory)

Page 8: GPU Architecture and Programming. GPU vs CPU .

Programming Paradigm

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Parallel function of application: execute as a kernel

Page 9: GPU Architecture and Programming. GPU vs CPU .

Programming Flow

1. Copy input data from CPU memory to GPU memory

2. Load GPU program and execute3. Copy results from GPU memory to CPU

memory

Page 10: GPU Architecture and Programming. GPU vs CPU .

• Each parallel function of application is execute as a kernel

• That means GPUs are programmed as a sequence of kernels; typically, each kernel completes execution before the next kernel begins.

• Fermi has some support for multiple, independent kernels to execute simultaneously, but most kernels are large enough to fill the entire machine.

Page 11: GPU Architecture and Programming. GPU vs CPU .

Image copied from http://people.maths.ox.ac.uk/~gilesm/hpc/NVIDIA/NVIDIA_CUDA_Tutorial_No_NDA_Apr08.pdf

Page 12: GPU Architecture and Programming. GPU vs CPU .

Hello World! Example

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

_ _global_ _ is a CUDA C/C++ keyword meaning • mykernel() will be exectued on the device• mykernel() will be called from the host

Page 13: GPU Architecture and Programming. GPU vs CPU .

Addition Example

• Since add runs on device, pointers a, b, and c must point to device memory

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Page 14: GPU Architecture and Programming. GPU vs CPU .

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Page 15: GPU Architecture and Programming. GPU vs CPU .

Vector Addition Example

Kernel Function:

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Page 16: GPU Architecture and Programming. GPU vs CPU .

main:

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Page 17: GPU Architecture and Programming. GPU vs CPU .

Alternative 1:

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Page 18: GPU Architecture and Programming. GPU vs CPU .

Alternative 2:

int globalThreadId = threadIdx.x + blockIdx.x * M //M is the number of threads in a block

Int globalThreadId = threadIdx.x + blockIdx.x * blockDim.x

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Page 19: GPU Architecture and Programming. GPU vs CPU .

• So the kernel becomes

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Page 20: GPU Architecture and Programming. GPU vs CPU .

• The main becomes

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf

Page 21: GPU Architecture and Programming. GPU vs CPU .

Handling Arbitrary Vector Sizes

Copy from http://on-demand.gputechconf.com/gtc-express/2011/presentations/GTC_Express_Sarah_Tariq_June2011.pdf