Memory and compiler optimizations for low-power and - HAL - INRIA
GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory...
Transcript of GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory...
![Page 1: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/1.jpg)
GPU Computing with CUDALecture 4 - Optimizations
Christopher Cooper Boston University
August, 2011UTFSM, Valparaíso, Chile
1
![Page 2: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/2.jpg)
Outline of lecture
‣ Recap of Lecture 3
‣Control flow
‣Coalescing
‣ Latency hiding
‣Occupancy
2
![Page 3: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/3.jpg)
Recap
‣ Shared memory
- Small on chip memory
- Fast (100x compared to global)
- Allows communication among threads within a block
‣ Tiling
- Use shared memory as cache
3
![Page 4: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/4.jpg)
Recap
‣ Tiling examples for FD
- Solution 1: ‣ Fetching to global in block boundaries
- Solution 2: ‣ Using halo nodes ‣ All threads load to shared, only some
operate
- Solution 3:‣ Using halo nodes‣ Load to shared in multiple steps, all
threads operate
4
![Page 5: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/5.jpg)
Recap
‣ Bank conflicts
- Arrays in shared memory are divided into banks
- Access to different data in the same bank by more than one thread in the same warp creates a bank conflict
- Bank conflicts serializes the access to the minimum number of non-conflicting instructions
5
![Page 6: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/6.jpg)
Control flow
‣ If statement
- Threads are executed in warps
- Within a warp, the hardware is not capable of executing if and else statements at the same time!
6
__global__voidfunction();{
....
if(condition){...}else{...}
}
![Page 7: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/7.jpg)
Control flow
‣How does the hardware deal with an if statement?
7
Instruction
Branch
A
B
Instruction
Warp
Thread
Warp divergence!
![Page 8: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/8.jpg)
Control flow
‣Hardware serializes the different execution paths
‣ Recommendations
- Try to make every thread in the same warp do the same thing
‣ If the if statement cuts at a multiple of the warp size, there is no warp divergence and the instruction can be done in one pass
- Remember threads are placed consecutively in a warp (t0-t31, t32-t63, ...)
‣ But we cannot rely on any execution order within warps
- If you can’t avoid branching, try to make as many consecutive threads as possible do the same thing
8
![Page 9: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/9.jpg)
Control flow - An illustration
‣ Let’s implement a sum reduction
‣ In a serial code, we would just loop over all elements and add
‣ Idea:
- To implement in parallel, let’s take every other value and add in place it to its neighbor
- To the resulting array do the same thing until we have only one value
9
__shared__floatpartialSum[]intt=threadIdx.xfor(intstride=1;stride<blockDim.x;stride*=2){
__syncthreads();if(t%(2*stride)==0)partialSum[t]+=partialSum[t+stride];
}
![Page 10: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/10.jpg)
Control flow - An illustration
10
0! 1! 2! 3! 4! 5! 7!6! 10!9!8! 11!
0+1! 2+3! 4+5! 6+7! 10+11!8+9!
0...3! 4..7! 8..11!
0..7! 8..15!
1!
2!
3!
Array elements !
iterations!
Thread 0! Thread 8!Thread 2! Thread 4! Thread 6! Thread 10!
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009!ECE498AL, University of Illinois, Urbana-Champaign!
![Page 11: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/11.jpg)
Control flow - An illustration
‣Advantages
- Right result
- Runs in parallel
‣Disadvantages
- The number of threads decreases per iteration, but we’re using much more warps than needed
‣ There is warp divergence in every iteration
- No more than half the threads per thread are being executed per iteration
‣ Let’s change the algorithm a little bit...11
![Page 12: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/12.jpg)
Control flow - An illustration
‣ Improved version
- Instead of adding neighbors, let’s add values with stride half a section away
- Divide the stride by two after each iteration
12
__shared__floatpartialSum[]intt=threadIdx.xfor(intstride=blockDim.x;stride>1;stride>>=1){
__syncthreads();if(t<stride)partialSum[t]+=partialSum[t+stride];
}
![Page 13: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/13.jpg)
Control flow - An illustration
13
Thread 0!
0! 1! 2! 3! …! 13! 15!14! 18!17!16! 19!
0+16! 15+31!1!
3!
4!
![Page 14: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/14.jpg)
Control flow - An illustration
‣ 512 elements
14
Iteration Exec. threads Warps
1 256 16
2 128 8
3 64 4
4 32 2
5 16 1
6 8 1
7 4 1
8 2 1
Threads > Warp
Threads < Warp
Warp divergence!
![Page 15: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/15.jpg)
Control flow - An illustration
‣We get warp divergence only for the last 5 iterations
‣Warps will be shut down as the iteration progresses
- This will happen much faster than for the previous case
- Resources are better utilized
- For the last 5 iterations, only 1 warp is still active
15
![Page 16: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/16.jpg)
Control flow - Loop divergence
‣Work per thread data dependent
16
__global__voidper_thread_sum(int*indices,float*data,float*sums){
...for(intj=indices[i];j<indices[i+1];j++){
sum+=data[j];}sums[i]=sum
}
David Tarjan - NVIDIA
![Page 17: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/17.jpg)
Control flow - Loop divergence
‣Warp wont finish until the last thread finishes
- Warp will be dragged
‣ Possible solution:
- Try flatten peaks by making threads work in multiple data
17
![Page 18: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/18.jpg)
Memory coalescing
‣Global memory is accessed in chunks of aligned 32, 64 or 128 bytes
‣ Following protocol is used to issue a memory transaction of a half warp (valid for 1.X)
- Find memory segment that contains address requested by the lowest numbered active thread
- Find other active threads whose requested address lies in same segment, and reduce transaction size if possible
- Do transaction. Mark serviced threads as inactive.
- Repeat until all threads are serviced
‣Worse case: fetch 32 bytes, use only 4 bytes: 7/8 wasted bandwidth!
18
![Page 19: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/19.jpg)
Memory coalescing
‣Access pattern visualization
- Thread 0 is lowest active, accesses address 116
‣ Belongs to 128-byte segment 0-127
19
0 32 64 96 128 160 192 224 256 288
t0 t1 t2 t3 t15...
!
128B
!
64B
!
32B ! 128B
!
64B
David Tarjan - NVIDIA
![Page 20: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/20.jpg)
Memory coalescing
‣ Simple access pattern
20
One 64 byte transaction
Will be looking at compute capability 1.X examples
![Page 21: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/21.jpg)
Memory coalescing
‣ Sequential but misaligned
21
One 128 byte transactionOne 64 byte transaction
and one 32 byte transaction
![Page 22: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/22.jpg)
Memory coalescing
‣ Strided access
22
One 128 byte transaction, but half of bandwidth is wasted
![Page 23: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/23.jpg)
Memory coalescing
‣ Example: Copy with stride
23
__global__voidstrideCopy(float*odata,float*idata,intstride){
intxid=(blockIdx.x*blockDim.x+threadIdx.x)*stride;odata[xid]=idata[xid];
}
![Page 24: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/24.jpg)
Memory coalescing
‣ 2.X architecture
- Global memory is cached
‣ Cached in both L1 and L2: 128 byte transaction
‣ Cached only in L2: 32 byte transaction
- Whole warps instead of half warps
24
![Page 25: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/25.jpg)
Memory coalescing - SoA or AoS?
‣Array of structures
25
structrecord{
intkey;intvalue;intflag;
};
record*d_record;cudaMalloc((void**)&d_records,...);
David Tarjan - NVIDIA
![Page 26: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/26.jpg)
Memory coalescing - SoA or AoS?
‣ Structure of array
26
structSoA{
int*key;int*value;int*flag;
};
SoA*d_AoA_data;cudaMalloc((void**)&d_SoA_data.keys,...);cudaMalloc((void**)&d_SoA_data.value,...);cudaMalloc((void**)&d_SoA_data.flag,...);
David Tarjan - NVIDIA
![Page 27: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/27.jpg)
Memory coalescing - SoA or AoS?
‣ cudaMalloc guarantees aligned memory, then accessing the SoA will be much more efficient
27
__global__voidbar(record*AoS_data,SoASoA_data){
inti=blockDim.x*blockIdx.x+threadIdx.x;//AoSwastesbandwidthintkey=AoS_data[i].key;//SoAefficientuseofbandwidthintkey_better=SoA_data.keys[i];
}
David Tarjan - NVIDIA
![Page 28: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/28.jpg)
Latency hiding
‣A warp is not scheduled until all threads have finished the previous instruction
‣ These instructions can have high latency (eg. global memory access)
‣ Ideally one wants to have enough warps to keep the GPU busy during the waiting time.
28
warp 8 instruction 11
SM multithreaded Warp scheduler
warp 1 instruction 42
warp 3 instruction 95
warp 8 instruction 12
. . .
time
warp 3 instruction 96
© David Kirk/NVIDIA and Wen-mei W. Hwu, 2007-2009!ECE 498AL, University of Illinois, Urbana-Champaign!
Latency hiding
![Page 29: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/29.jpg)
Latency hiding
‣How many instructions do we need to hide latency of L clock cycles?
- 1.X: L/4 instructions
‣ SM issues one instruction per warp over four cycles
- 2.0: L instructions
‣ SM issues one instruction per warp over two clock cycles for two warps at a time
- 2.1: L/2 instructions
‣ SM issues a pair of instructions per warp over two clock cycles for two warps at a time
29
![Page 30: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/30.jpg)
Latency hiding
‣Doing the exercise
- Instruction takes 22 clock cycles to complete
‣ 1.X: we need 6 available warps to hide latency
‣ 2.X: we need 22 available warps to hide latency
- Fetch to global memory: ~600 cycles!
‣ Depends on FLOPs per memory access (arithmetic intensity)
‣ eg. if ratio is 15:
- 10 warps for 1.X
- 40 warps for 2.X
30
![Page 31: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/31.jpg)
Occupancy
‣Occupancy is the ratio of resident warps to maximum number of resident warps
‣Occupancy is determined by the resource limits in the SM
- Maximum number of blocks per SM
- Maximum number of threads
- Shared memory
- Registers
31
![Page 32: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/32.jpg)
Occupancy
‣ --ptxas-options=-v for compiler to tell you register and shared memory usage
‣Use CUDA Occupancy Calculator
‣ The idea is to find the optimal threads per block for maximum occupancy
‣Occupancy != performance: but low occupancy codes have a hard time hiding latency
‣Demonstration
32
![Page 33: GPU Computing with CUDA Lecture 4 - Optimizations · Recap ‣Shared memory - Small on chip memory - Fast (100x compared to global) - Allows communication among threads within a block](https://reader034.fdocuments.in/reader034/viewer/2022042804/5f5ae05e1fdad5414607fb64/html5/thumbnails/33.jpg)
Measuring performance
‣ Right measure depends on program
- Compute bound: operations per second
‣ Performance given by ratio of number of operations and timing of kernel
‣ Peak performance depends on architecture: Fermi ~1TFLOP/s
- Memory bound: bandwidth
‣ Performance given by ratio of number of global memory accesses and timing of kernel
‣ Peak bandwidth for GTX 280 (1.107GHz with 512-bit width)
33
1107*106*(512/8)*2/109 = 141.7 GB/s