Designing physics Algorithms for gpu architecture

Designing physics Algorithms for gpu architecture

Takahiro HARADAAMD

2| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Narrow phase on GPU

Narrow phase is parallel

– How to solve each pair?

Design it for a specific architecture


Radeon HD 5870

– 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec

– Many cores 20SIMDs x 64 wide SIMD

CPU SSE 4 wide SIMD

Program of a work item is packed in VLIW, then executed

GPU Architecture

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD




CoreCore CoreCore CoreCore CoreCore

Radeon HD 5870 Phenom II X4

CoreCoreSIMDSIMD

20(SIMDs)x16(Thread processors) x 5(Stream cores) = 1600


156.3GB/s156.3GB/s

Memory

Register

Global memory

– “Main memory”

– Large

– High latency

Local Data Store （ LDS)

– Low latency

– High bandwidth

– Like a user managed cache

– Key to get high performance SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD




GPUSIMDSIMD

Global Memory> 1GB

Local Data Share32KB


Narrow phase on CPU

Methods on CPUs(GJK)

– Any convex shapes

– Possible to implement on the GPU

– Complicated for GPU

– Divergence => Low use of ALUs

GPU prefer simpler algorithm with less logic

Why GPU is not good at complicated logic?

– Wide SIMD architecture

Void Kernel(){ executeX();

switch() { case A: { executeA(); break; } case B: { executeB(); break; } case C: { executeC(); break; } } finish();}

0 1 2 3 4 5 6 7

25%

25%

50%


Narrow phase on GPU

Particles

– Search for neighboring particle

– Collide to all

– Accurate shape representation needs Increase resolution

Acceleration structure in each shape

– Increase complexity

Explode number of contacts

Etc..

Can we make it better but keep it simple?

Void Kernel()

{

prepare();

collide(p0);

collide(p1);

collide(p2);

collide(p3);

collide(p4);

collide(p5);

collide(p6);

collide(p7);

}

0 1 2 3 4 5 6 7


a Good approach for GPUs, from architecture

Have to know what GPUs likes

– Less branch

Less divergence

– Use LDS on SIMD

– Latency hiding

Why latency?


Work group(WG), work item(WI)





Radeon HD 5870

SIMDSIMD

Work Group0Work Group0



SIMD lane(64lanes)Work item(64items)

Particle[0-63]

Particle[64-127]

Particle[128-191]


How GPU hides Latency?

Memory access latency

– Not rely on cache

SIMD hides latency by switching WGs

The more WGs/SIMD is the better

– 1WG/SIMD cannot hide latency

– Overlap work to memory request

What determines # of WGs/SIMD?

– Local resource usage

WorkGroup0

WorkGroup1

WorkGroup2

WorkGroup3

Void Kernel(){ readGlobalMem();

compute();}


Why reduce Resource usage?

Regs are limited resource

# of WGs/SIMD

– SIMD regs/(kernel regs use)

– LDS/(kernel LDS use)

Less # of regs

– More WGs

– Hide latency

Register overflow -> global memory

KernelA Regs:8

KernelB Regs:4

KernelC Regs:2

SIMD Engine (8 regs)

11

11

11

22

22 33 44


Preview of Current Approach

1 WG processes 1 pair Reduce resource usage

Less branch

–Compute is branch free

–No dependency

Use of LDS

–No global mem access on compute()

–Random access to LDS

Latency hiding

–Pair data for a WG not per WI

WIs work together

Unified method for all the shapes

0 1 2 3 4 5 6 7Void Kernel(){ fetchToLDS(); BARRIER;

compute();

BARRIER; workTogether();

BARRIER; Writeback();}

Solver

Fusion


Choosing a processor

CPU can do everything

– Not good for highly parallel computations as GPU

GPU is very powerful processor

– Only for parallel computation

Real problem has both

GPU is far from CPU


Fusion

GPU and CPU are close

Faster communication between GPU and CPU

Use both GPU and CPU

– Parallel workload -> GPU

– Serial workload -> CPU


Collision between large and small particles

Granularity of computation

– Large particle collide more

– Inefficient use of the GPU

0 1 2 3 4 5 6 7


Q & A

Designing physics Algorithms for gpu architecture

Documents

Transcript of Designing physics Algorithms for gpu architecture