SGRT: A Scalable Mobile GPU Architecture based on Ray...
Transcript of SGRT: A Scalable Mobile GPU Architecture based on Ray...
SGRT: A Scalable Mobile GPU
Architecture based on Ray Tracing
Won-Jong Lee†, Shi-Hwa Lee†, Jae-Ho Nah*, Jin-Woo Kim*,
Youngsam Shin†, Jaedon Lee†, Seok-Yoon Jung†
SAIT, SAMSUNG Electronics†, Yonsei Univ.*, Korea
Talks, ACM SIGGRAPH 2012
Outline • Introduction
• SGRT Core Architecture – T&I Engine: H/W Accelerator
– SRP : Programmable DSP
– SMK : Parallelization Framework
• Experimental Results
• Conclusion
Talk, ACM SIGGRAPH 2012
Introduction
Talks, ACM SIGGRAPH 2012
Graphics Trends
’10 ’15 ’20
PC/Console
Mobile/CE
Reality
Realistic 3D Game (‘10)
Immersive AR/MR
3D Game (‘04)
Smart Phone (‘09)
Realistic 3D Game on Mobile/CE
Immersive AR/MR on Mobile/CE
Smart TV (‘10)
• Graphics is being important as increasing smart devices
• Evolving toward more realistic graphics
• Mobile graphics template earlier PC graphics (5~6 years)
Talk, ACM SIGGRAPH 2012
Mobile SoC
Talk, ACM SIGGRAPH 2012
Apple A5X Die Photo Image Courtesy: Chipworks
• Inadequate Performance
– Flagship mobile GPU: ~256GFLOPS (ARM Mali T658)
– Real-time ray tracing @HD: >300Mray/sec (1~2TFLOPS)
• Unsuitable Execution Model
– “Multithreaded SIMD” is not fit for processing incoherent rays
• Weak Branch Supports
– Performance drops when recursion, function calls, control flow…
Current Mobile GPU for Ray Tracing
Talk, ACM SIGGRAPH 2012
• Dedicated, Fixed Function H/W
– Performance & power-efficient, but weak flexibility
– RPU [Woop, SIGGRAPH 2005]
• Fully Programmable Processor
– Flexible, but inadequate performance and power consumption
– Reconfigurable stream processor [Kim, CICC 2012] : 1~2 Mrays/sec
– MIMD threaded processor [Spjut, SHAW-3 2012] : ~30 Mrays/sec
Need a New Architecture?
Talk, ACM SIGGRAPH 2012
• Performance for Real Time Rendering
– 200~300Mray/sec
• Reasonable Flexibility
– Programmable shading and ray generation
– Support various BVHs : SAH/Binned/SBVH/LBVH..
– Easy to extend to GI (path tracing, photon mapping..)
– Easy to combine rasterizer (OpenGL|ES) and ray tracing
• Low Power & Cost
Requirements
Talk, ACM SIGGRAPH 2012
SGRT
Talks, ACM SIGGRAPH 2012
• Combination of CPU, H/W and DSP (Mobile SoC)
– Tree Build: sorting, irregular work Multi-core CPU (with multi-level $)
– Refit, Traversal, Intersection: embarrassingly parallel Dedicated H/W
– Ray Gen. & Shading: need for flexibility Programmable DSP
Our Approach
Talk, ACM SIGGRAPH 2012
Dedicated H/W
(Traversal &
Intersection)
Programmable
DSP
(Ray Gen. &
Shading)
Multi-core CPU
(Tree Build)
Memory Memory
SGRT Core #4 SGRT Core #3
SGRT Core #2
• SGRT (Samsung reconfigurable GPU based on Ray Tracing)
– T&I Engine: fast, compact H/W to accelerate traversal & intersection
– SRP: Samsung Reconfigurable Processor to support flexible shading
– SMK : Parallelization framework
System Architecture
Talk, ACM SIGGRAPH 2012
SGRT Core #1
External DRAM
T&I Engine
Intersection
Unit
Cache(L1)
Traversal Unit
Cache(L1)
Traversal Unit
Cache(L1)
Traversal Unit
Cache(L1)
Traversal Unit
Cache(L1)
Cache(L2)
SRP VLIW Engine
Internal SRAM
Coarse Grained Reconfigurable
Array
I-Cache
C-Mem
Texture Unit
Cache(L1)
Multi-core ARM
Core #1 Core #2
Core #3 Core #4
Host DRAM
Host System BUS
Refitting
Unit
AXI System BUS
T&I Engine : A MIMD H/W Accelerator
• Newly designed H/W Accelerator based
on our previous work – KDtree H/W
engine [Nah, SIGGRAPH ASIA 2011]
– Single-ray-based MIMD architecture
: Efficient processing for incoherent rays
– Ray Accumulation Unit (RAU)
: Hardware multithreading
• Optimized restart & short stack algorithm
– Adaptive restart trail [Lee, HPG 2012]
• Early Intersection Test
– Reducing expensive ray-primitive IST test
Talk, ACM SIGGRAPH 2012
T&I Engine
Ray Dispatcher
Intersection Unit
L2$ L1$
L1$
Traversal Unit
Traversal Unit
Traversal Unit Traversal Unit
L1$ RAU pipe
stack
L1$ RAU pipe
Rays
Hit info
L1$
L1$
Traversal Unit
Traversal Unit
Traversal Unit Traversal Unit
L1$ RAU pipe
stack L1$
L1$
Traversal Unit
Traversal Unit
Traversal Unit Traversal Unit
L1$ RAU pipe
stack L1$
L1$
Traversal Unit
Traversal Unit
Traversal Unit Traversal Unit
L1$ RAU pipe
stack
MIMD arch.
Early (Two-Pass) Intersection Test
Inner node
Leaf node
Primitive AABB
Primitive
1
2 3
4 5
6 7
10 11
8 9
1
2 3
T0 T1 T2 T3 T5 T6 T7 T4
Talk, ACM SIGGRAPH 2012
Ray-nodeAABB Test Ray-Primitive Test
Ray-nodeAABB Test
Ray-primAABB Test Ray-Primitive Test
Conventional IST
Early IST
Traversal Unit Intersection Unit
Traversal Unit Intersection Unit
Ray Accumulation Unit
Ray Accumulation Unit • Specialized H/W multi-threading for latency hiding [Nah, 2011]
– $ missed rays are accumulated in RA buffer, other rays can be processed during this period
– Coherence can be increased, the rays that reference the same cache line are accumulated
in the same row in an RA buffer
– Experimental results, up to 3x performance gain
4
0
1
3
rays
cache address cache
data occupation counter
Traversal or Intersection pipeline
Non-blocking
CACHE
hit result cache data
cache address
Ray + data
ray
Control Buffer
Input Buffer
Cache hit Cache miss
Talk, ACM SIGGRAPH 2012
Samsung Reconfigurable Processor • A flexible architecture template [Lee, HPG 2011/2012]
• ISA such as arithmetic, special function and texture are properly implemented.
• The VLIW engine useful for GP computations (function invocation, control flow).
• The CGRA makes full use of software pipeline technique for loop acceleration.
Talk, ACM SIGGRAPH 2012
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
FU
RF
Central RF (Register file)
FU FU FU FU
Instruction DATA
CGA
VLIW for ( )
{
Loop
}
for ( )
{
Loop
}
for ( )
{
Loop
}
Control proc
Data proc
Control proc
Data proc
Control proc
Data proc
Packet Stream Tracing on SRP
• Remove recursion
Job-Q based streamed iteration
• Classified according to the types of
operation CGA kernel
• A packet of rays are batched
• Each kernels are mapped on CGA,
loop accelerated
– shows high IPC rate up to the
maximum number of FU arrays
Talk, ACM SIGGRAPH 2012
Classify hit rays, Update colors
Compute normal vectors
Classify second rays & texture
Gen. second rays Compute texture color
Compute N·L,
classify shading
Shading,
Gen. shadow rays
Reflection
Ray
Refraction
Ray
Shadow
Ray
Intersection Result
CGA Kernels
VLIW code
Parallelization Framework • Parallel ray tracing with multi-tasking system
– Utilized embedded RTOS, SMK (Samsung Multi-Platform Kernel) [Shin, SAC 2011]
– Supports multi-tasking by systematic scheduling in the task queues
• Individual task for each SGRT core is responsible for
– Different pixels (or pixel tiles), the scheduler can distribute the next tasks to
the idle SGRT core first, dynamic load balancing
Talk, ACM SIGGRAPH 2012
T&I
Engine
SRP
T&I
Engine
SRP
T&I
Engine
SRP
SMK SMK SMK
Evaluation
Talks, ACM SIGGRAPH 2012
• Built a cycle accurate simulator (T&I Engine), and a in-house
cycle accurate compiled simulator, called csim (SRP)
• Test condition w/ two benchmarks
– Full SAH, cost ratio 5:1 (TRV:IST) for shallow tree
– Ferrari scene (210K triangles, 1 light source)
– Fairy scene (170K triangles, 2 light sources)
– Shadow, reflection, refraction @WVGA (800x640)
Simulation Environment
Talk, ACM SIGGRAPH 2012
• Architecture configuration
– 4 SGRT cores, traversal & intersection unit = 4:1 per SGRT core
– 1Ghz core clock
• Achieved around 170 MRPS (T&I), 255 MRPS (RGS) for Fairy
– Recent GPU ray tracer (156~317 MRPS, NVIDIA Kepler) [Alia, HPG 2012]
Preliminary Results
Talk, ACM SIGGRAPH 2012
Scene
# of
tri.
# of
ray
T&I Engine SRP Simulated
FPS Pipeline
usage
TRV $
hit ratio
IST $
hit ratio MRPS MRPS
Fairy 170K 1.7M 87.27 93.83 96.53 171.32 255.72 87.82
Ferrari 210K 1.5M 79.75 92.56 92.92 122.48 319.56 67.83
FPGA
Talk, ACM SIGGRAPH 2012
• Currently, we are also testing the SGRT on FPGA board
Conclusion
Talks, ACM SIGGRAPH 2012
• SGRT: A novel mobile GPU based on ray tracing,
– first mobile GPU to realize a real-time ray tracing
• Carefully designed to suit for mobile SoC environment
• Currently implementing the T&I engine at the RTL level
• Future work
– Analyze cost and power consumption
– Support dynamic scenes with a fast BVH build algorithm
optimized for mobile environment
– Higher-level shading/ecosystem
• Poster (#103) session: 8/7, 8/8 12:15-13:15PM
Conclusion
Talk, ACM SIGGRAPH 2012
• This project is based on the collaboration with two University
(Yonsei, National Kongju). Authors appreciate to two professors
(Tack-Don Han, Hyun-Sang Park) for their valuable advices.
• Thanks
Acknowledgements
Talk, ACM SIGGRAPH 2012