Code GPU with CUDA - Device code optimization principle

13
CODE GPU WITH CUDA DEVICE CODE OPTIMIZATION PRINCIPLE Created by Marina Kolpakova ( ) for cuda.geek Itseez PREVIOUS

Transcript of Code GPU with CUDA - Device code optimization principle

Page 1: Code GPU with CUDA - Device code optimization principle

CODE GPU WITH CUDADEVICE CODE OPTIMIZATION PRINCIPLE

Created by Marina Kolpakova ( ) for cuda.geek Itseez

PREVIOUS

Page 2: Code GPU with CUDA - Device code optimization principle

OUTLINEOptimization principlePerformance limitersLittle’s lawTLP & ILP

Page 3: Code GPU with CUDA - Device code optimization principle

DEVICE CODE OPTIMIZATION PRINCIPLESpecific of SIMT architecture makes GPU to be latent at all, so

Hiding latency is the only GPU-specific optimization principle

Typical latencies for Kepler generationregister writeback: ~10 cyclesL1: ~34 cyclesTexture L1: ~96 cyclesL2: ~160 cyclesGlobal memory: ~350 cycles

Page 4: Code GPU with CUDA - Device code optimization principle

PERFORMANCE LIMITERSOptimize for GPU ≃ Optimize for latency

Factors that pervert latency hiding:

Insufficient parallelismInefficient memory accessesInefficient control flow

Page 5: Code GPU with CUDA - Device code optimization principle

THROUGHPUT & LATENCYThroughput

is how many operations are performed in one cycleLatency

is how many cycles pipeline stalls before another dependent operationInventory

is a number of warps on fly i.e. in execution stage of the pipeline

Page 6: Code GPU with CUDA - Device code optimization principle

LITTLE’S LAWL = λ × W

Inventory (L) = Throughput (λ) × Latency (W)

Example: GPU with 8 operations per clock and 18 clock latency

Page 7: Code GPU with CUDA - Device code optimization principle

LITTLE’S LAW: FFMA EXAMPLEFermi GF100

Throughput: 32 operations per clock (1 warp)Latency: ~18 clocksMaximum resident warps per SM: 24Inventory: 1 * 18 = 18 warps on fly

Kepler GK110Throughput: 128 (if no ILP) operations per clock (4 warps)Latency: ~10 clocksMaximum resident warps per SM: 64Inventory: 4 * 10 = 40 warps on fly

Maxwell GM204Throughput: 128 operations per clock (4 warps)Latency: ~6 clocksMaximum resident warps per SM: 64Inventory: 4 * 6 = 24 warps on fly

Page 8: Code GPU with CUDA - Device code optimization principle

TLP & ILPThread Level Parallelism

enabling factors:sufficient number of warps per SM on fly

limiting factors:bad launch configurationresource consuming kernelspoorly parallelized code

Instruction Level Parallelismenabling factors:

independent instructions per warpdual issue capabilities

limiting Factors:structural hazardsdata hazards

Page 9: Code GPU with CUDA - Device code optimization principle

IMPROVING TLPOccupancy

is actual number of warps running concurrently on a multiprocessor divided bymaximum number of warps that can be run concurrently by hardware

Improve occupancy to achieve better TLPModern GPUs can keep up to 64 resident warps belonging to 16(Kepler)/32(Maxwell)blocks BUT you need recourses for them: registers, smemKepler has 64 K. × 32-bit registers and 32-lane wide warp

65536 registers / 64 warps / 32 warp_size = 32 registers / thread

Page 10: Code GPU with CUDA - Device code optimization principle

IMPROVING ILPKernel unrolling: process more elements by thread, because operations on differentelements are independent

Device code compiler is not bad in instruction reorderingLoop unrolling in device code to increase number of independent operations

Other techniques used for increasing ILP on CPU are suitable

_ _ g l o b a l _ _ v o i d u n r o l l e d ( c o n s t f l o a t * i n , f l o a t * o u t ){ c o n s t i n t t i d = b l o c k D i m . x * b l o c k I d x . x + t h r e a d I d x . x ; c o n s t i n t t o t a l T h r a d s = b l o c k D i m . x * g r i d D i m . x ; o u t [ t i d ] = p r o c e s s ( i n [ t i d ] ) ; o u t [ t i d + t o t a l T h r a d s ] = p r o c e s s ( i n [ t i d + t o t a l T h r a d s ] ) ;}

# p r a g m a u n r o l l C O N S T _ E X P R E S S I O Nf o r ( i n t i = 0 ; i < N _ I T E R A T I O N S ; i + + ) { / * . . . * / }

Page 11: Code GPU with CUDA - Device code optimization principle

ILP ON MODERN GPUS

ILP is a mast-have for older architectures, but still help to hide pipeline latencies onmodern GPUs

Maxwell: 4 warp schedulers dual-issue each. 128 compute cores process up to 4 warpseach clock. Compute cores utilization: 1.0Kepler: 4 warp schedulers, dual-issue each. 192 compute cores process up to 6 warpseach clock. If there is no ILP only 128 of 192 cores are used. Compute cores utilization:0.6(6)Fermi (sm_21): 2 warp schedulers, dual-issue each. 48 compute cores process 3 warpseach 2 clock. If there is no ILP only 32 of 48 cores are used. Compute cores utilization:0.6(6)

Page 12: Code GPU with CUDA - Device code optimization principle

FINAL WORDSGPU optimization principles:

Principle #1: hide latencyPrinciple #2: see principle #1