Recipe for running simple CUDA code on a GPU based Rocks cluster
Code GPU with CUDA - Device code optimization principle
-
Upload
marina-kolpakova -
Category
Education
-
view
1.112 -
download
3
Transcript of Code GPU with CUDA - Device code optimization principle
CODE GPU WITH CUDADEVICE CODE OPTIMIZATION PRINCIPLE
Created by Marina Kolpakova ( ) for cuda.geek Itseez
PREVIOUS
OUTLINEOptimization principlePerformance limitersLittle’s lawTLP & ILP
DEVICE CODE OPTIMIZATION PRINCIPLESpecific of SIMT architecture makes GPU to be latent at all, so
Hiding latency is the only GPU-specific optimization principle
Typical latencies for Kepler generationregister writeback: ~10 cyclesL1: ~34 cyclesTexture L1: ~96 cyclesL2: ~160 cyclesGlobal memory: ~350 cycles
PERFORMANCE LIMITERSOptimize for GPU ≃ Optimize for latency
Factors that pervert latency hiding:
Insufficient parallelismInefficient memory accessesInefficient control flow
THROUGHPUT & LATENCYThroughput
is how many operations are performed in one cycleLatency
is how many cycles pipeline stalls before another dependent operationInventory
is a number of warps on fly i.e. in execution stage of the pipeline
LITTLE’S LAWL = λ × W
Inventory (L) = Throughput (λ) × Latency (W)
Example: GPU with 8 operations per clock and 18 clock latency
LITTLE’S LAW: FFMA EXAMPLEFermi GF100
Throughput: 32 operations per clock (1 warp)Latency: ~18 clocksMaximum resident warps per SM: 24Inventory: 1 * 18 = 18 warps on fly
Kepler GK110Throughput: 128 (if no ILP) operations per clock (4 warps)Latency: ~10 clocksMaximum resident warps per SM: 64Inventory: 4 * 10 = 40 warps on fly
Maxwell GM204Throughput: 128 operations per clock (4 warps)Latency: ~6 clocksMaximum resident warps per SM: 64Inventory: 4 * 6 = 24 warps on fly
TLP & ILPThread Level Parallelism
enabling factors:sufficient number of warps per SM on fly
limiting factors:bad launch configurationresource consuming kernelspoorly parallelized code
Instruction Level Parallelismenabling factors:
independent instructions per warpdual issue capabilities
limiting Factors:structural hazardsdata hazards
IMPROVING TLPOccupancy
is actual number of warps running concurrently on a multiprocessor divided bymaximum number of warps that can be run concurrently by hardware
Improve occupancy to achieve better TLPModern GPUs can keep up to 64 resident warps belonging to 16(Kepler)/32(Maxwell)blocks BUT you need recourses for them: registers, smemKepler has 64 K. × 32-bit registers and 32-lane wide warp
65536 registers / 64 warps / 32 warp_size = 32 registers / thread
IMPROVING ILPKernel unrolling: process more elements by thread, because operations on differentelements are independent
Device code compiler is not bad in instruction reorderingLoop unrolling in device code to increase number of independent operations
Other techniques used for increasing ILP on CPU are suitable
_ _ g l o b a l _ _ v o i d u n r o l l e d ( c o n s t f l o a t * i n , f l o a t * o u t ){ c o n s t i n t t i d = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ; c o n s t i n t t o t a l T h r a d s = b l o c k D i m . x * g r i d D i m . x ; o u t [ t i d ] = p r o c e s s ( i n [ t i d ] ) ; o u t [ t i d + t o t a l T h r a d s ] = p r o c e s s ( i n [ t i d + t o t a l T h r a d s ] ) ;}
# p r a g m a u n r o l l C O N S T _ E X P R E S S I O Nf o r ( i n t i = 0 ; i < N _ I T E R A T I O N S ; i + + ) { / * . . . * / }
ILP ON MODERN GPUS
ILP is a mast-have for older architectures, but still help to hide pipeline latencies onmodern GPUs
Maxwell: 4 warp schedulers dual-issue each. 128 compute cores process up to 4 warpseach clock. Compute cores utilization: 1.0Kepler: 4 warp schedulers, dual-issue each. 192 compute cores process up to 6 warpseach clock. If there is no ILP only 128 of 192 cores are used. Compute cores utilization:0.6(6)Fermi (sm_21): 2 warp schedulers, dual-issue each. 48 compute cores process 3 warpseach 2 clock. If there is no ILP only 32 of 48 cores are used. Compute cores utilization:0.6(6)
FINAL WORDSGPU optimization principles:
Principle #1: hide latencyPrinciple #2: see principle #1
THE ENDNEXT
BY / 2013–2015CUDA.GEEK