Designing physics Algorithms for gpu architecture

20
Designing physics Algorithms for gpu architecture Takahiro HARADA AMD

description

Designing physics Algorithms for gpu architecture. Takahiro HARADA AMD. Narrow phase on GPU. Narrow phase is parallel How to solve each pair? Design it for a specific architecture. GPU Architecture. Radeon HD 5870 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec Many cores - PowerPoint PPT Presentation

Transcript of Designing physics Algorithms for gpu architecture

Page 1: Designing physics Algorithms  for  gpu  architecture

Designing physics Algorithms for gpu architecture

Takahiro HARADAAMD

Page 2: Designing physics Algorithms  for  gpu  architecture

2| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Narrow phase on GPU

Narrow phase is parallel

– How to solve each pair?

Design it for a specific architecture

Page 3: Designing physics Algorithms  for  gpu  architecture

3| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Radeon HD 5870

– 2.72TFLOPS(S), 544GFLOPS(D), 153.6GB/sec

– Many cores 20SIMDs x 64 wide SIMD

CPU SSE 4 wide SIMD

Program of a work item is packed in VLIW, then executed

GPU Architecture

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

CoreCore CoreCore CoreCore CoreCore

Radeon HD 5870 Phenom II X4

CoreCoreSIMDSIMD

20(SIMDs)x16(Thread processors) x 5(Stream cores) = 1600

Page 4: Designing physics Algorithms  for  gpu  architecture

4| Designing Physics Algorithms for GPU Architecture | March 1, 2011

156.3GB/s156.3GB/s

Memory

Register

Global memory

– “Main memory”

– Large

– High latency

Local Data Store ( LDS)

– Low latency

– High bandwidth

– Like a user managed cache

– Key to get high performance SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

GPUSIMDSIMD

Global Memory> 1GB

Local Data Share32KB

Page 5: Designing physics Algorithms  for  gpu  architecture

5| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Narrow phase on CPU

Methods on CPUs(GJK)

– Any convex shapes

– Possible to implement on the GPU

– Complicated for GPU

– Divergence => Low use of ALUs

GPU prefer simpler algorithm with less logic

Why GPU is not good at complicated logic?

– Wide SIMD architecture

Void Kernel(){ executeX();

switch() { case A: { executeA(); break; } case B: { executeB(); break; } case C: { executeC(); break; } } finish();}

0 1 2 3 4 5 6 7

25%

25%

50%

Page 6: Designing physics Algorithms  for  gpu  architecture

6| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Narrow phase on GPU

Particles

– Search for neighboring particle

– Collide to all

– Accurate shape representation needs Increase resolution

Acceleration structure in each shape

– Increase complexity

Explode number of contacts

Etc..

Can we make it better but keep it simple?

Void Kernel()

{

prepare();

collide(p0);

collide(p1);

collide(p2);

collide(p3);

collide(p4);

collide(p5);

collide(p6);

collide(p7);

}

0 1 2 3 4 5 6 7

Page 7: Designing physics Algorithms  for  gpu  architecture

7| Designing Physics Algorithms for GPU Architecture | March 1, 2011

a Good approach for GPUs, from architecture

Have to know what GPUs likes

– Less branch

Less divergence

– Use LDS on SIMD

– Latency hiding

Why latency?

Page 8: Designing physics Algorithms  for  gpu  architecture

8| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Work group(WG), work item(WI)

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD SIMDSIMD

Radeon HD 5870

SIMDSIMD

Work Group0Work Group0

Work Group1Work Group1

Work Group2Work Group2

SIMD lane(64lanes)Work item(64items)

Particle[0-63]

Particle[64-127]

Particle[128-191]

Page 9: Designing physics Algorithms  for  gpu  architecture

9| Designing Physics Algorithms for GPU Architecture | March 1, 2011

How GPU hides Latency?

Memory access latency

– Not rely on cache

SIMD hides latency by switching WGs

The more WGs/SIMD is the better

– 1WG/SIMD cannot hide latency

– Overlap work to memory request

What determines # of WGs/SIMD?

– Local resource usage

WorkGroup0

WorkGroup1

WorkGroup2

WorkGroup3

Void Kernel(){ readGlobalMem();

compute();}

Page 10: Designing physics Algorithms  for  gpu  architecture

10| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Why reduce Resource usage?

Regs are limited resource

# of WGs/SIMD

– SIMD regs/(kernel regs use)

– LDS/(kernel LDS use)

Less # of regs

– More WGs

– Hide latency

Register overflow -> global memory

KernelA Regs:8

KernelB Regs:4

KernelC Regs:2

SIMD Engine (8 regs)

11

11

11

22

22 33 44

Page 11: Designing physics Algorithms  for  gpu  architecture

11| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Preview of Current Approach

1 WG processes 1 pair Reduce resource usage

Less branch

–Compute is branch free

–No dependency

Use of LDS

–No global mem access on compute()

–Random access to LDS

Latency hiding

–Pair data for a WG not per WI

WIs work together

Unified method for all the shapes

0 1 2 3 4 5 6 7Void Kernel(){ fetchToLDS(); BARRIER;

compute();

BARRIER; workTogether();

BARRIER; Writeback();}

Page 12: Designing physics Algorithms  for  gpu  architecture

Solver

Page 13: Designing physics Algorithms  for  gpu  architecture

13| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Page 14: Designing physics Algorithms  for  gpu  architecture

14| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Page 15: Designing physics Algorithms  for  gpu  architecture

Fusion

Page 16: Designing physics Algorithms  for  gpu  architecture

16| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Choosing a processor

CPU can do everything

– Not good for highly parallel computations as GPU

GPU is very powerful processor

– Only for parallel computation

Real problem has both

GPU is far from CPU

Page 17: Designing physics Algorithms  for  gpu  architecture

17| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Fusion

GPU and CPU are close

Faster communication between GPU and CPU

Use both GPU and CPU

– Parallel workload -> GPU

– Serial workload -> CPU

Page 18: Designing physics Algorithms  for  gpu  architecture

18| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Collision between large and small particles

Granularity of computation

– Large particle collide more

– Inefficient use of the GPU

0 1 2 3 4 5 6 7

Page 19: Designing physics Algorithms  for  gpu  architecture

19| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Page 20: Designing physics Algorithms  for  gpu  architecture

20| Designing Physics Algorithms for GPU Architecture | March 1, 2011

Q & A