Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart

Streaming v. Multicore inGraphics Applications

Jared HoberockVictor LuSanjay PatelJohn C. HartUniv. of Illinois

Dynamic Virtual Environments

World of Warcraft•Social Internet World•Completely Unconstrained(can build & share things)•Lower Quality Graphics

Grand Theft Auto IV•“Sandbox” World•Free Interaction(within gamespace)•High Quality Graphics

Halo 3•First-Person Shooter•Constrained Interaction•Photorealistic Graphics(much precomputation)

Dynamic, Flexible“Game” Graphics

Precomputed, Rigid“Film” Graphics

Multicore enables both flexibility and photorealism

Videogame Production

• Costly° Expensive: $10M/title° Slow: 3+ years/title

• Compromises° Precomputed visibility – restricts viewer

mobility and environment complexity° Precomputed lighting – restricts scene dynamics, user alterations° Precomputed motion – restricts movement to mocap data, rigging

• Consequences° Significant development effort to achieve realtime rates° Dynamic social gamespace quality lags that of solo/team shooter

levels

• Solution° Leverage multicore power to ray trace for dynamic visibility & lighting

How Close Are We?

• Single CPU ray tracing° RTRT Core renders at

1~5 Hz on 2.5 GHz P4° Need 60 Hz for games° 30 GHz CPU needed to ray

trace game scenes [Schmittler et al., Realtime Ray Tracing for Current & Future Games, SIGGRAPH 2006 Real Time Ray Tracing Course Notes]

• We won’t see a 30GHz serial processor (burns too brightly!)

• We will see 16+ cores• But can we do in parallel

what we predict in serial?Ingo Wald, RTRT Core, SIGGRAPH 2005Real Time Ray Tracing Course Notes

Spatial Data Structures

Nearest Neighbor Problems in Graphics• Rendering: Photon Mapping (k-NN)

° Find 500 photons nearest to a ray-surfaceintersection to compute surface’s illumination

• Modeling: Surface Reconstruction (-NN)° Surface reconstructed at each point depends o

locations of nearest points within a given distance

• Animation: Collision Detection (-NN)° Collision between multiple interacting elements accelerated by

avoiding all pairs intersections

Built on hierarchical spatial data structuresHow can we build, query and maintain on SIMD

GPU’s?

kD-Tree

• Hierarchy of axis-aligned partitions° 2-D partitions are lines° 3-D partitions are planes

• Axis of partitions alternates wrt depth of the tree

• Average access time is O(log n)• Worst case O(n) when tree is

severely lopsided• Need to maintain a balanced

tree O(n log n)• Can find k nearest neighbors in

O(k + log n) time using a heap

GPU Hierarchy Traversal

• SIMD “stackless” hierarchy traversal° Prethread with hit/miss pointers° Hit pointer points to first child° Miss pointer points to next sibling

or if last sibling then ancestor’s sibling

• References° Foley & Sugerman, kD-tree

Acceleration Structures for aGPU raytracer, Graphics HW 05

° Carr, Hoberock, Crane & Hart,Fast GPU Ray Tracing of Dynamic Meshes Using Geometry Images, Graphics Interface 2006

GPU Hierarchy Construction• Recent approaches sort first,

then organize into hierarchy° Zhou, Hou, Wang, Guo, “Real-

Time KD-Tree Construction onGraphics Hardware, SIGGRAPH Asia 2008

° Godiyal, Hoberock, Hart, Garland,“Rapid Multipole Graph Drawingon the GPU,” Graph Drawing 2008

• Latter uses kD-tree for fastn-body approximation tocompute force directed layout

• CPU+GPU° CPU builds kD-tree° GPU performs median selection° Practical when > 50K elements

Incoherent Shader Execution

• Videogame graphics rasterize triangles° Same shader applied to all pixels

(fragments) in triangle° Shading & visibility occur simultaneously

• Future videogames will also trace rays° Visibility first, then shading

• Primary eye rays are coherent• Secondary rays are reflected or

scattered into incoherent shader queries

• Different shader (not just different shader data) applied to each ray° e.g. hair, skin, cloth, liquids, foliage

Chris Wyman

GPU Architecture

• GPU = MIMD of SIMD• MIMD processing

° Cell: 8 MIMD nodes° GF8800: 16 MIMD nodes° LRB: 32 MIMD nodes

• SIMD processing° Cell: 4 per MIMD node° GF8800: 8 per MIMD node° LRB: 16 per MIMD node

• Some MIMD nodes have distinct “control” processors though similar processing could occur via one SIMD node (masking rest)

• LRB “core” is a MIMD proc., NVIDIA “core” is a SIMD proc.• NVIDIA “warp” is 32 threads streaming on one MIMD node

MIMD NodeMIMD NodeSIMD

Node

SIMD

Node

IBM Cell Architecture

Flex I/O

Memoryinterfacecontroller

Businterfacecontroller

Dual XDR

32 bytes/cycle 16 bytes/cycle

Element interconnect bus (up to 96 bytes/cycle)

16 bytes/cycle

Synergistic processor elements

Powerprocessorelement

Powerprocessor unit

Powerexecution

unit

L1cache

L2 cache

Localstore

SXUSPU

SMF

Localstore

SXUSPU

SMF

Localstore

SXUSPU

SMF

Localstore

SXUSPU

SMF

16 bytes/cycle 16 bytes/cycle (2x)

Localstore

SXUSPU

SMF

Localstore

SXUSPU

SMF

Localstore

SXUSPU

SMF

Localstore

SXUSPU

SMF

64-bit Power Architecture with vector media extensions

Gschwind et al., Synergistic Processing inCell’s Multicore Architecture, IEEE Micro, 2006

NVIDIA Tesla Architecture

Conditional Program Flow

• High-performance stuck with low-level streaming SIMD

• Even in multicore• Problem with SIMD:

Conditional Program Flow° If a data-dependent condition

leads to two different program flows

° Then both program flows must be executed on all SIMD nodes (serialization)

° Result masked per SIMD processor by the condition data

MIMD for loopSIMD for loop

if (X) then A else B

MIMD for loopSIMD for loop

if (X) then A else B

TT TT TT TT TT TT TT FFX:

X?X?

AA BB

AA

BB

Mask on XMask on X

AA AA AA AA AA AA AA BB

Deferred Shading

• Handle visibility first° Intersect rays w/scene° Store result for later shading

• Shade ray intersections• If different rays in the same

MIMD node need differentshaders, then shaders areserialized

• O(NS) performance° N = # of rays° S = # of shaders (per MIMD node)° O(S) when distributed across N

processes

MIMD for all raysSIMD for all rays

intersect ray with sceneset mask to shader #


for all shaders in SIMD ray warp

shader(ray) if mask == shader


intersect ray with sceneset mask to shader #


for all shaders in SIMD ray warp

shader(ray) if mask == shader

Process Sorting

• Need to bucket computations to move those with identical control flows onto the SIMD processors of the same MIMD node

• When is it worth the trouble?

Scan (Prefix Sum)

1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0

0 1 1 1 1 1 1 1 2 2 3 3 3 3 4 4

Shader Scheduling

• Sort jobs based onshader request° Radix sort° Segmented scan° Global v. local sort

• Load MIMD nodes onlywith rays requesting thesame shader

• Still O(NS)° Performing O(N) scan on each of S shaders

• Can we scan on all shaders simultaneously?


intersect ray with scene

MIMD for all shadersScan rays needing that shaderMIMD for all rays needing that

shaderSIMD for all rays

shader(ray)


intersect ray with scene

MIMD for all shadersScan rays needing that shaderMIMD for all rays needing that

shaderSIMD for all rays

shader(ray)

Stanford Bunny in Cornell Box

• Three shaders: wall, glass, light

• Shaders simple• Warp size: 32

Hit Incoherence

Branches

Eff.

1 0.6% 1.15 87%

2 30% 2.40 42%

3 38% 2.55 39%

4 40% 2.65 37%

5 40% 2.67 37%

How often ray’s shader differs from previous

ray’s

How often ray’s shader differs from previous

ray’s

Average # of branches per

warp

Average # of branches per

warp

Automotive/CAD Viz

• DJ_Designs via Google 3D WH

• 16 simple shaders• Small parts ameliorate their

shader’s impact on overall efficiency

Bounce

Incoherence Efficiency

1 1.6% 28%

2 40% 13%

3 30% 14%

4 22% 15%

5 17% 17%

Angel in Cornell Box

• Four shaders:• wall, light simple• marble, wood are more

expensive, procedural

Bounce


1 1.2% 77%

2 52% 23%

3 53% 21%

4 47% 22%

5 40% 23%

Siebel Center Staircase

• Six shaders° Copper, glass girder,

chrome, marble, light

• Efficiency bump due to smooth glass/chrome coherence and rays exiting the scene

Bounce


1 3% 68%

2 34% 34%

3 36% 33%

4 33% 32%

5 30% 34%

Efficiency Images

Branching Penalties

Warp size: 32

All 32 SIMD threads

must follow the samecontrol

flow

16 shaders

one shader

• Shader execution° Serial: one at a time° SIMD: as a “big switch”

• Serialized° Slower, wastes

processors° Avoids locks° Can conserve memory

• Compare w/ & w/o stream compaction

Memory Coherence

Processes:

Memory:

Processes:

Memory:

Processes:

Memory:

Scheduling Approaches

• Five Options° Serial Unsorted° Serial Global Compaction° Parallel Unsorted° Parallel Local Compaction° Parallel Global Compaction

• Each variation involves bookkeeping overhead

0

Serialized SIMD Parallel

Unsorted SortedGlobal

Unsorted SortedLocal

SortedGlobal

500

1000

1500

2000

2500

3000

3500

Observations•Even for these modest scenes there are significant performance gains•Local per-node compaction doesn't work•Even zero-time sort would not improve most cases•Local per-node workloads hindered by too many shaders to schedule•Faster stream compaction: Prefix sum, Scatter/Gather

Conclusions

• Stream compaction° Not practical for simple shaders° Practical for procedural textures (wood, marble)° Probably for complex shaders (hair, cloth, skin)

• Warp coherence nevertheless leads to data incoherence° Even when all shaders in a MIMD node run the same

shader, their data is still distributed across memory, outside of cache boundaries

• Static tuning ok, but run-time better• Broader implication to object polymorphism

° Streaming same objects with different virtual function tables

Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart

Documents

Transcript of Streaming v. Multicore in Graphics Applications Jared Hoberock Victor Lu Sanjay Patel John C. Hart