GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail...
-
Upload
nathaniel-booker -
Category
Documents
-
view
213 -
download
0
Transcript of GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail...
![Page 1: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/1.jpg)
GPUs: Overview of Architecture and
Programming Options
Lee Barford
firstname dot lastname at gmail dot com
![Page 2: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/2.jpg)
Outline
Why parallel computing is now important
What GPUs are and what they provide
Overview of GPU architecture
• Enough to orient the discussion of programming them
• Future changes
Three “languages” for programming GPUs
• Those we’re not doing include CUDAFortran, Python CUDA & CL bindings, WebCL
![Page 3: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/3.jpg)
3
Graph from UC Berkeley ParLab
Serial AppPerformance
Exponentially growing gap
![Page 4: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/4.jpg)
4
Graphics Processor (GPU) as Parallel Accelerator
• Commodity priced, massively parallel floating point
• Claimed performance on various problems 50-2500x CPU running serial code
Graph from http://drdobbs.com/high-performance-computing/231500166
![Page 5: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/5.jpg)
The GPU as a Co-Processor to the CPU:The physical and logical connections
Main memory
chipset
GPU memory
PCIe
Slow
Control actions & code (kernels) to run
I/Os:• Video• Ethernet• USB hub• Firewire• …
CPU
GPU
Running GPU code is like requesting asynchronous I/O
![Page 6: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/6.jpg)
0.5-3 years from now: Fusion of CPU and GPU
CPU
Main memory I/O subsystem
Multiple cores
GPU
Running GPU code will be like pending method pointers for future execution. (Like C++11, TBB, TPL, PPL).
Hardware task scheduler
![Page 7: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/7.jpg)
Programming Tomorrow’s CPU will be Like Programming Today’s GPU
• GPUs that compute will come “for free” with computers
• Slow step of moving data to/from GPU will be eliminated
• Hardware task scheduler for both CPU and GPU will
• Almost eliminate OS & I/O overhead for invoking GPU kernels
• Also almost eliminate OS overhead for invoking parallel tasks on CPU
• AMD laptop chip available now (but no boards/systems)
• NVIDIA GPU+ARM chip available now for battery operated devices
• Both promise desktop chips in next year or two
• Programming models will probably evolve from what we’ll cover
• Course will use current, PCIe-based GPUs
• We will be dealing with overheads that will pass away over next few years
![Page 8: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/8.jpg)
CUDA (NVIDIA) GPU Compute Architecture:Many Simple, Floating-Point Cores
![Page 9: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/9.jpg)
32 cores (Streaming Multiprocessor) share:
• Instruction stream
• Registers
• Execute same program (kernel)
• SPMD: ~ [Same place in same kernel at the same time]
• Act as 100-1000’s more cores by switching context instead of waiting for memory
1000’s of virtual cores executing same lines of code together, but
Sharing limited resources
Cores organized into groups
![Page 10: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/10.jpg)
GPU has multiple SMs
• SMs run in parallel
• Do not need to be executing same location in the same program at the same time
• In aggregate, many 1000’s of parallel copies of same kernel running simultaneously
• Total of up to 1Tflop/s at peak
CENTRAL SOFTWARE ISSUE:
• How to generate and control this much parallelism
![Page 11: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/11.jpg)
GPUs: Programming Options
• Libraries: called from CPU code. Write no GPU code. Examples:
• Image/video processing, dense & sparse matrix, FFT, random numbers
• Generic programming for GPU
• Thrust
• Like C++ Standard Template Library
• Specialize & use built-in data structures and algorithms
• NVIDIA GPUs only
• Programming the GPU directly
• CUDA C/C++, OpenCL, WebCL, CUDA Fortran, various Python libraries
• Write code that runs on GPU (kernels)
• Write CPU code that directly controls and coordinates
– Data movement between CPU memory and GPU memory
– Startup of kernels on GPU
– CPU processing of results from GPU when they become available
![Page 12: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/12.jpg)
CUDA C/C++ vs OpenCL
CUDA C/C++• Proprietary (NVIDIA)
• Code runs on NVIDIA GPUs
• Reportedly 10-50% faster than OpenCL
• Compiles at build time to binary code for particular targeted hardware
• Specific NVIDIA hardware architecture versions
• No compiler available at run time
OpenCL• Open standard (Khronos)
• Code runs on NVIDIA & AMD GPUs, x86 multicore, FPGAs (academic research) at the same time
• Compiles at build time to intermediate form that is compiled at run time for the hardware that is present
• Compiler is available at run time
• Can execute downloaded or dynamically generated source code
![Page 13: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/13.jpg)
The Three Programming Environments We’ll Cover
OpenCL:• Write once, run many• Supports heterogeneous parallel machines (fusion)• Tool chains good enough for research• IMHO, will eventually replace CUDA C/C++
CUDA C/C++:• Very efficient code• Lots of fussy detail to get that efficiency• Robust tool chains for Linux, Windows, MacOS• Specific to NVIDIA
Thrust:• Easy to write• Algorithms provided among the fastest (e.g., sort)• NVIDIA GPUs only
![Page 14: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/14.jpg)
Class Project Idea
• Accurate edge finding in a 1D signal
• Journal paper published on multicore version
• Student project last year doing Thrust implementation
• Project: Do CUDA version + performance tests
• Paper combining previous student’s work with above: 60% probability of getting accepted in a particular IEEE conference
• 3 co-authors, including previous student & Lee
• Extended abstract due: Nov 6
• Class project due during finals, same as everyone else
• Camera ready paper due: March 4
• See or email me in the next week or two if interested
![Page 15: GPUs: Overview of Architecture and Programming Options Lee Barford firstname dot lastname at gmail dot com.](https://reader030.fdocuments.in/reader030/viewer/2022032702/56649f435503460f94c63c31/html5/thumbnails/15.jpg)
Questions