Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre
description
Transcript of Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre
![Page 1: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/1.jpg)
1
Trial Lecture
The Use of GPUs for High-Performance Computing
12. October 2010
Magnus Jahre
![Page 2: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/2.jpg)
2
Graphic Processors (GPUs)
• Modern computers are graphics intensive
• Advanced 3D graphics require a significant amount of computation
Graphics Card (Source: nvidia.com)
Solution: Add a Graphics Processor (GPU)
![Page 3: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/3.jpg)
3
High-Performance Computing
High-Performance Computing (HPC)
General Purpose Programming on GPUs (GPGPU)
Efficient use of computers for computationally intensive problems in science or engineering
Pro
cess
ing
Dem
and
Communication Demand
Weather forecastClimate modeling
Dynamic Molecular Simulation
Computational Computer
Architecture
Office Applications
Third dimension:
Main Memory Capacity
![Page 4: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/4.jpg)
4
Outline
• GPU Evolution
• GPU Programming
• GPU Architecture
• Achieving High GPU Performance
• Future Trends
• Conclusions
![Page 5: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/5.jpg)
5
GPU EVOLUTION
![Page 6: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/6.jpg)
6
First GPUs: Fixed Hardware
[Blythe 2008]
Vertex Processing
RasterizationFragment
ProcessingFramebuffer Operations
Vertex DataTexture Maps
Depth BufferColor Buffer
![Page 7: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/7.jpg)
7
Programmable Shaders
Vertex Processing
RasterizationFragment
ProcessingFramebuffer Operations
Vertex DataTexture Maps
Depth BufferColor Buffer
Motivation: More flexible graphics processing
![Page 8: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/8.jpg)
8
GPGPU with Programmable Shaders
Vertex Processing
RasterizationFragment
ProcessingFramebuffer Operations
Vertex DataTexture Maps
Depth BufferColor Buffer
Use Graphics Library to gain access to GPU
Use color values to code data
Effect of fixed function stages must be accounted for
![Page 9: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/9.jpg)
9
Functional Unit Utilization
Vertex Processing
Fragment Processing
Vertex Processing
Fragment Processing
RasterizationFramebuffer Operations
Vertex DataTexture Maps
Depth BufferColor Buffer
![Page 10: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/10.jpg)
10
Functional Unit Utilization
Vertex Processing
Fragment Processing
Vertex intensive shader
Fragment intensive shader
Unified shader
![Page 11: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/11.jpg)
11
Unified Shader Architecture• Exploit parallelism
– Data parallelism– Task parallelism
• Data parallel processing (SIMD/SIMT)
• Hide memory latencies
• High bandwidth
Architecture naturally supports GPGPU
SP SP
SP SP
SP SP
SP SP
Memory
SP SP
SP SP
SP SP
SP SP
Memory
SP SP
SP SP
SP SP
SP SP
Memory
SP SP
SP SP
SP SP
SP SP
Memory
Thread Scheduler
Interconnect
On-Chip Memory or Cache
Off-Chip DRAM Memory
![Page 12: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/12.jpg)
12
GPU PROGRAMMING
![Page 13: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/13.jpg)
13
Programmable Shaders Unified Shaders
GPGPU Tool Support
Sh
PeakStreamAccelerator
GPU++
CUDA
OpenCL
2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
3
1
GPU papers on Supercomputing
![Page 14: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/14.jpg)
14
Compute Unified Device Architecture (CUDA)• Most code is normal C+
+ code
• Code to run on GPU organized in kernels
• CPU sets up and manages computation
__global__ void vector_add(float* a, float* b, float* c){ int idx = threadIdx.x; c[idx] = a[idx] + b[idx];}
int main(){ int N = 512; // ... vector_add<<<1,N>>>(a_d, b_d, c_d); // ...}
![Page 15: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/15.jpg)
15
Thread/Data Organization• Hierarchical thread
organization– Grid– Block– Thread
• A block can have a maximum of 512 threads
• 1D, 2D and 3D mappings possible
Block (0,0)
Block (0,1)
Block (0,2)
Block (1,0)
Block (1,1)
Block (1,2)
Grid
Block (0)
Block (1)
Grid
![Page 16: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/16.jpg)
16
C
B
A A
B
Global Memory Main MemoryGPU CPU
SP SP SP SP
SP SP SP SP
A
B
CC
Local Memory
Vector Addition Example
A collection of concurrently processed threads is called a warp
![Page 17: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/17.jpg)
17
Terminology: Warp
![Page 18: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/18.jpg)
18
Vector Addition Profile
• Only 11% of GPU time is used to add vectors
• The arithmetic intensity of the problem is too low
• Overlapping data copy and computation could help
11%
58%
32%
%GPU time
vector_add memcpyHtoDmemcpyDtoH
Hardware: NVIDIA MVS 3100M
![Page 19: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/19.jpg)
19
Will GPUs Save the World?
• Careful optimization of both CPU and GPU code reduces the performance difference between GPUs and CPUs substantially [Lee et al., ISCA 2010]
• GPGPU has provided nice speedups for problems that fit the architecture
• Metric challenge: The practitioner needs performance per developer hour
![Page 20: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/20.jpg)
20
GPU ARCHITECTURE
![Page 21: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/21.jpg)
21
NVIDIA Tesla Architecture
Figure reproduced from [Lindholm et al.; 2008]
![Page 22: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/22.jpg)
22
Control Flow
• The threads in a warp share use the same instruction
• Branching is efficient if all threads in a warp branch in the same direction
• Divergent branches within a warp cause serial execution of both paths
IF
Condition True Threads
Condition False Threads
Condition True Threads
Condition False Threads
![Page 23: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/23.jpg)
23
Modern DRAM Interfaces
• Maximize bandwidth with 3D organization
• Repeated requests to the row buffer are very efficient
Row address
Column address
DRAM
Banks
Row Buffer
Rows
Co
lum
ns
![Page 24: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/24.jpg)
24
Access Coalescing
• Global memory accesses from all threads in a half-warp are combined into a single memory transaction
• All memory elements in a segment are accessed
• Segment size can be halved if only the lower or upper half is used
Assumes Compute Capability 1.2 or higher
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Addr 128
Addr 132
Addr 136
Addr 140
Addr 144
Addr 148
Addr 152
Addr 156
Addr 124
Addr 120
Addr 116
Addr 112Tran
saction
Transactio
n
![Page 25: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/25.jpg)
25
Bank Conflicts
• Memory banks can service requests independently
• Bank conflict: more than one thread access a bank concurrently
• Strided access patterns can cause bank conflicts
Thread 0
Thread 1
Thread 2
Thread 3
Thread 4
Thread 5
Thread 6
Thread 7
Bank 0
Bank 1
Bank 2
Bank 3
Bank 4
Bank 5
Bank 6
Bank 7
Stride two accesses gives 2-way bank conflict
![Page 26: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/26.jpg)
26
NVIDIA Fermi• Next generation computing
chip from NVIDIA
• Aims to alleviate important bottlenecks– Improved double precision
floating point support– Cache hierarchy– Concurrent kernel execution
• More problems can be solved efficiently on a GPU
Figure reproduced from [NVIDIA; 2010]
![Page 27: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/27.jpg)
27
ACHIEVING HIGH GPU PERFORMANCE
![Page 28: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/28.jpg)
28
Which problems fit the GPU model?
• Fine-grained data parallelism available• Sufficient arithmetic intensity• Sufficiently regular data access patterns
It’s all about organizing data
Optimized memory system use enables high performance
![Page 29: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/29.jpg)
29
Increase Computational Intensity• Memory types:
– On-chip shared memory: Small and fast
– Off-chip global memory: Large and slow
• Technique: Tiling– Choose tile size such
that it fits in the shared memory
– Increases locality by reducing reuse distance
A × B = C
×
=
Reuse!
Reuse!
![Page 30: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/30.jpg)
30
Memory Layout
• Exploit coalescing to achieve high bandwidth
• Linear access necessary
• Solution: Tiling
A × B = C
×
=
Assume row-major storage
Coalesced Not Coalesced
![Page 31: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/31.jpg)
31
W1 W2 W3 W4W1 W2 W3 W4
Avoid Branching Inside Warps
Assume 2 threads per warp
All iterations diverge
8
4 4
One iteration diverges
8
11 1 1 1 1 1 1 1
2 2 2 2
44
1 1 1 1 1 1 1 1
2 2 2 2
![Page 32: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/32.jpg)
32
Automation
• Thread resource usage must be balanced with the number of concurrent threads [Ryoo et al., PPoPP08]– Avoid saturation– Sweet spot will vary between devices– Sweet spot varies with problem sizes
• Auto-tuning 3D FFT [Nukada et al.; SC2009]– Balance resource consumption vs. parallelism with kernel radix and
ordering – Best number of thread blocks chosen automatically– Inserts padding to avoid shared memory bank conflicts
![Page 33: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/33.jpg)
33
Case Study: Dynamic Molecular Simulation with NAMD
Simulate the interaction of atoms due to the laws of atomic physics and quantum chemistry [Phillips; SC2009]
![Page 34: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/34.jpg)
34
Key Performance Enablers
• Careful division of labor between GPU and CPU– GPU: Short range non-bonded forces– CPU: Long-range electrostatic forces and coordinate updates
• Overlap CPU and GPU execution through asynchronous kernel execution
• Use event recording to track progress in asynchronously executing streams
[Phillips et al., SC2008]
![Page 35: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/35.jpg)
35
CPU/GPU Cooperation in NAMD
[Phillips et al., SC2008]
CPU
GPU
Remote Local Local Update
Remote Local
Time
ff
f f
x
x x
![Page 36: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/36.jpg)
36
Challenges
• Completely restructuring legacy software systems is prohibitive
• Batch processing software are unaware of GPUs
• Interoperability issues with pinning main memory pages for DMA
[Phillips et al., SC2008]
![Page 37: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/37.jpg)
37
FUTURE TRENDS
![Page 38: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/38.jpg)
38
Accelerator Integration• Industry move towards integrating
CPUs and GPUs on the same chip– AMD Fusion [Brookwood; 2010]– Intel Sandy Bridge (fixed function
GPU)
• Are other accelerators appropriate?– Single-chip Heterogeneous
Computing: Does the future include Custom Logic, FPGAs, and GPUs? [Chung et al.; MICRO 2010]
AMD FusionReproduced from [Brookwood; 2010]
![Page 39: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/39.jpg)
39
Vector Addition Revisited
Start-up and shut-down data transfers are the main bottleneck
Fusion eliminates these overheads by storing values in the on-chip cache
Using accelerators becomes more feasible
![Page 40: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/40.jpg)
40
Memory System Scalability
• Current CPU bottlenecks:– Number of pins on a chip grows slowly– Off-chip bandwidth grows slowly
• Integration only helps if there is sufficient on-chip cooperation to avoid significant increase in bandwidth demand
• Conflicting requirements:– GPU: High bandwidth, not latency sensitive– CPU: High bandwidth, can be latency sensitive
![Page 41: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/41.jpg)
41
CONCLUSIONS
![Page 42: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/42.jpg)
42
Conclusions
• GPUs can offer a significant speedup for problems that fit the model
• Tool support and flexible architectures increases the number of problems that fit the model
• CPU/GPU on-chip integration can reduce GPU start-up overheads
![Page 43: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/43.jpg)
43
Thank You
Visit our website:http://research.idi.ntnu.no/multicore/
![Page 44: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/44.jpg)
44
References• Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on
CPU and GPU; Lee et al.; ISCA; 2010• Programming Massively Parallel Processors; Kirk and Hwu; Morgan Kaufmann; 2010• NVIDIA’s Next Generation CUDA Compute Architecture: Fermi; White Paper; NVIDIA;
2010• AMD Fusion Family of APUs: Enabling a Superior Immersive PC Experience; White
Paper; AMD; 2010• Multi-core Programming with OpenCL: Performance and Portability; Fagerlund; Master
Thesis; NTNU; 2010• Complexity Effective Memory Access Scheduling for Many-Core Accelerator
Architectures; Yuan et al.; MICRO; 2009• Auto-Tuning 3-D FFT Library for CUDA GPUs; Nukada and Matsuoka; SC; 2009• Programming Graphics Processing Units (GPUs); Bakke; Master Thesis; NTNU; 2009• Adapting a Message-Driven Parallel Application to GPU-Accelerated Clusters; Phillips
et al.; SC; 2008• Rise of the Graphics Processor; Blythe; Proceedings of the IEEE; 2008• NVIDIA Tesla: A Unified Graphics and Computing Architecture; Lindholm et al; IEEE
Micro; 2008• Optimization Principles and Application Performance Evaluation of a Multithreaded
GPU using CUDA; Ryoo et al.; PPoPP; 2008
![Page 45: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/45.jpg)
45
EXTRA SLIDES
![Page 46: Trial Lecture The Use of GPUs for High-Performance Computing 12. October 2010 Magnus Jahre](https://reader035.fdocuments.in/reader035/viewer/2022070405/56813f03550346895da984f3/html5/thumbnails/46.jpg)
46
Complexity-Effective Memory Access Scheduling• On-chip interconnect
may interleave requests from different thread processors
• Row locality is destroyed
• Solution: Order-preserving interconnect arbitration policy and in-order scheduling
[Lee et al., MICRO2009]
Req 0Row A
Req 0Row B
Req 1Row A
Row Switch
Row Switch
Req 0Row A
Req 0Row B
Req 1Row A
Row Switch
Req 1Row B
Row Switch
Req 1Row B
Time
In-order Scheduling
Out-of-order Scheduling
Queue:
Req 0Row A
Req 0Row B
Req 1Row A
Req 1Row B
Req 1Row A
Req 0Row B
Req 0Row A
Req 0Row B
Req 1Row A
Row Switch
Req 1Row B
Performance of out-of-order scheduling with less complex in-order scheduling