SGRT: A Scalable Mobile GPU Architecture based on Ray...

SGRT: A Scalable Mobile GPU

Architecture based on Ray Tracing

Won-Jong Lee†, Shi-Hwa Lee†, Jae-Ho Nah*, Jin-Woo Kim*,

Youngsam Shin†, Jaedon Lee†, Seok-Yoon Jung†

SAIT, SAMSUNG Electronics†, Yonsei Univ.*, Korea

Talks, ACM SIGGRAPH 2012

Outline • Introduction

• SGRT Core Architecture – T&I Engine: H/W Accelerator

– SRP : Programmable DSP

– SMK : Parallelization Framework

• Experimental Results

• Conclusion

Talk, ACM SIGGRAPH 2012

Introduction


Graphics Trends

’10 ’15 ’20

PC/Console

Mobile/CE

Reality

Realistic 3D Game (‘10)

Immersive AR/MR

3D Game (‘04)

Smart Phone (‘09)

Realistic 3D Game on Mobile/CE

Immersive AR/MR on Mobile/CE

Smart TV (‘10)

• Graphics is being important as increasing smart devices

• Evolving toward more realistic graphics

• Mobile graphics template earlier PC graphics (5~6 years)


Mobile SoC


Apple A5X Die Photo Image Courtesy: Chipworks

• Inadequate Performance

– Flagship mobile GPU: ~256GFLOPS (ARM Mali T658)

– Real-time ray tracing @HD: >300Mray/sec (1~2TFLOPS)

• Unsuitable Execution Model

– “Multithreaded SIMD” is not fit for processing incoherent rays

• Weak Branch Supports

– Performance drops when recursion, function calls, control flow…

Current Mobile GPU for Ray Tracing


• Dedicated, Fixed Function H/W

– Performance & power-efficient, but weak flexibility

– RPU [Woop, SIGGRAPH 2005]

• Fully Programmable Processor

– Flexible, but inadequate performance and power consumption

– Reconfigurable stream processor [Kim, CICC 2012] : 1~2 Mrays/sec

– MIMD threaded processor [Spjut, SHAW-3 2012] : ~30 Mrays/sec

Need a New Architecture?


• Performance for Real Time Rendering

– 200~300Mray/sec

• Reasonable Flexibility

– Programmable shading and ray generation

– Support various BVHs : SAH/Binned/SBVH/LBVH..

– Easy to extend to GI (path tracing, photon mapping..)

– Easy to combine rasterizer (OpenGL|ES) and ray tracing

• Low Power & Cost

Requirements


SGRT


• Combination of CPU, H/W and DSP (Mobile SoC)

– Tree Build: sorting, irregular work Multi-core CPU (with multi-level $)

– Refit, Traversal, Intersection: embarrassingly parallel Dedicated H/W

– Ray Gen. & Shading: need for flexibility Programmable DSP

Our Approach


Dedicated H/W

(Traversal &

Intersection)

Programmable

DSP

(Ray Gen. &

Shading)

Multi-core CPU

(Tree Build)

Memory Memory

SGRT Core #4 SGRT Core #3

SGRT Core #2

• SGRT (Samsung reconfigurable GPU based on Ray Tracing)

– T&I Engine: fast, compact H/W to accelerate traversal & intersection

– SRP: Samsung Reconfigurable Processor to support flexible shading

– SMK : Parallelization framework

System Architecture


SGRT Core #1

External DRAM

T&I Engine

Intersection

Unit

Cache(L1)

Traversal Unit

Cache(L1)

Traversal Unit

Cache(L1)

Traversal Unit

Cache(L1)

Traversal Unit

Cache(L1)

Cache(L2)

SRP VLIW Engine

Internal SRAM

Coarse Grained Reconfigurable

Array

I-Cache

C-Mem

Texture Unit

Cache(L1)

Multi-core ARM

Core #1 Core #2

Core #3 Core #4

Host DRAM

Host System BUS

Refitting

Unit

AXI System BUS

T&I Engine : A MIMD H/W Accelerator

• Newly designed H/W Accelerator based

on our previous work – KDtree H/W

engine [Nah, SIGGRAPH ASIA 2011]

– Single-ray-based MIMD architecture

: Efficient processing for incoherent rays

– Ray Accumulation Unit (RAU)

: Hardware multithreading

• Optimized restart & short stack algorithm

– Adaptive restart trail [Lee, HPG 2012]

• Early Intersection Test

– Reducing expensive ray-primitive IST test


T&I Engine

Ray Dispatcher

Intersection Unit

L2$ L1$

L1$

Traversal Unit

Traversal Unit

Traversal Unit Traversal Unit

L1$ RAU pipe

stack

L1$ RAU pipe

Rays

Hit info

L1$

L1$

Traversal Unit

Traversal Unit


L1$ RAU pipe

stack L1$

L1$

Traversal Unit

Traversal Unit


L1$ RAU pipe

stack L1$

L1$

Traversal Unit

Traversal Unit


L1$ RAU pipe

stack

MIMD arch.

Early (Two-Pass) Intersection Test

Inner node

Leaf node

Primitive AABB

Primitive

1

2 3

4 5

6 7

10 11

8 9

1

2 3

T0 T1 T2 T3 T5 T6 T7 T4


Ray-nodeAABB Test Ray-Primitive Test

Ray-nodeAABB Test

Ray-primAABB Test Ray-Primitive Test

Conventional IST

Early IST

Traversal Unit Intersection Unit

Traversal Unit Intersection Unit

Ray Accumulation Unit

Ray Accumulation Unit • Specialized H/W multi-threading for latency hiding [Nah, 2011]

– $ missed rays are accumulated in RA buffer, other rays can be processed during this period

– Coherence can be increased, the rays that reference the same cache line are accumulated

in the same row in an RA buffer

– Experimental results, up to 3x performance gain

4

0

1

3

rays

cache address cache

data occupation counter

Traversal or Intersection pipeline

Non-blocking

CACHE

hit result cache data

cache address

Ray + data

ray

Control Buffer

Input Buffer

Cache hit Cache miss


Samsung Reconfigurable Processor • A flexible architecture template [Lee, HPG 2011/2012]

• ISA such as arithmetic, special function and texture are properly implemented.

• The VLIW engine useful for GP computations (function invocation, control flow).

• The CGRA makes full use of software pipeline technique for loop acceleration.


FU

RF

FU

RF

FU

RF

FU

RF

FU

RF

FU

RF

FU

RF

FU

RF

FU

RF

FU

RF

FU

RF

FU

RF

Central RF (Register file)

FU FU FU FU

Instruction DATA

CGA

VLIW for ( )

{

Loop

}

for ( )

{

Loop

}

for ( )

{

Loop

}

Control proc

Data proc

Control proc

Data proc

Control proc

Data proc

Packet Stream Tracing on SRP

• Remove recursion

Job-Q based streamed iteration

• Classified according to the types of

operation CGA kernel

• A packet of rays are batched

• Each kernels are mapped on CGA,

loop accelerated

– shows high IPC rate up to the

maximum number of FU arrays


Classify hit rays, Update colors

Compute normal vectors

Classify second rays & texture

Gen. second rays Compute texture color

Compute N·L,

classify shading

Shading,

Gen. shadow rays

Reflection

Ray

Refraction

Ray

Shadow

Ray

Intersection Result

CGA Kernels

VLIW code

Parallelization Framework • Parallel ray tracing with multi-tasking system

– Utilized embedded RTOS, SMK (Samsung Multi-Platform Kernel) [Shin, SAC 2011]

– Supports multi-tasking by systematic scheduling in the task queues

• Individual task for each SGRT core is responsible for

– Different pixels (or pixel tiles), the scheduler can distribute the next tasks to

the idle SGRT core first, dynamic load balancing


T&I

Engine

SRP

T&I

Engine

SRP

T&I

Engine

SRP

SMK SMK SMK

Evaluation


• Built a cycle accurate simulator (T&I Engine), and a in-house

cycle accurate compiled simulator, called csim (SRP)

• Test condition w/ two benchmarks

– Full SAH, cost ratio 5:1 (TRV:IST) for shallow tree

– Ferrari scene (210K triangles, 1 light source)

– Fairy scene (170K triangles, 2 light sources)

– Shadow, reflection, refraction @WVGA (800x640)

Simulation Environment


• Architecture configuration

– 4 SGRT cores, traversal & intersection unit = 4:1 per SGRT core

– 1Ghz core clock

• Achieved around 170 MRPS (T&I), 255 MRPS (RGS) for Fairy

– Recent GPU ray tracer (156~317 MRPS, NVIDIA Kepler) [Alia, HPG 2012]

Preliminary Results


Scene

# of

tri.

# of

ray

T&I Engine SRP Simulated

FPS Pipeline

usage

TRV $

hit ratio

IST $

hit ratio MRPS MRPS

Fairy 170K 1.7M 87.27 93.83 96.53 171.32 255.72 87.82

Ferrari 210K 1.5M 79.75 92.56 92.92 122.48 319.56 67.83

FPGA


• Currently, we are also testing the SGRT on FPGA board

Conclusion


• SGRT: A novel mobile GPU based on ray tracing,

– first mobile GPU to realize a real-time ray tracing

• Carefully designed to suit for mobile SoC environment

• Currently implementing the T&I engine at the RTL level

• Future work

– Analyze cost and power consumption

– Support dynamic scenes with a fast BVH build algorithm

optimized for mobile environment

– Higher-level shading/ecosystem

• Poster (#103) session: 8/7, 8/8 12:15-13:15PM

Conclusion


• This project is based on the collaboration with two University

(Yonsei, National Kongju). Authors appreciate to two professors

(Tack-Don Han, Hyun-Sang Park) for their valuable advices.

• Thanks

Acknowledgements


SGRT: A Scalable Mobile GPU Architecture based on Ray...

Documents

Transcript of SGRT: A Scalable Mobile GPU Architecture based on Ray...