Real-Time Graphics Architecture
Transcript of Real-Time Graphics Architecture
4/18/2007
1
Real-Time Graphics Architecture
Lecture 4: Parallelism and Communication
p
Kurt Akeley
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Kurt Akeley
Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/
Topics
1. Frame buffers
2. Types of parallelism
3. Communication patterns and requirements
4. Sorting classification for parallel rendering (with examples)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
4/18/2007
2
Frame Buffers
Raster vs. calligraphic
Raster (image order)
dominant choice
Calligraphic (object order)
Earliest choice (Sketchpad)
E&S terminals in the 70s and 80s
Works with light pens
Scene complexity affects frame rate
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Scene complexity affects frame rate
Monitors are expensive
Still required for FAA simulationIncreases absolute brightness of light points
4/18/2007
3
Frame buffer definitions
What is a frame buffer?
What can we learn by considering different definitions?
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Frame buffer definition #1
Storage for commands that are executed to refresh the display
Allows for raster or calligraphic display (e g Megatech)Allows for raster or calligraphic display (e.g. Megatech)
“Frame buffer” for calligraphic display is a “display list”
OpenGL “render list”?
Key point: frame buffer contents are interpreted
Color mapping
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Image scaling, warping
Window system (overlay, separate windows, …)
Address Recalculation Pipeline
4/18/2007
4
Frame buffer definition #2
Image memory used to decouple the render frame rate from the display frame rate
Meets common understanding of frame buffer as imageMeets common understanding of frame buffer as image
Leads naturally to double buffering
One render buffer, one display buffer, swap
n-buffering also possible, can control latency
Key idea: decoupling enables general-purpose GPU
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Visual simulation has high render frame rate
MCAD has low render frame rate
Window manager has no frame rate
Frame buffer definition #3
All pixel-assigned memory used to assemble and display the images being rendered
Key point: frame buffer is active participant in renderingLeads to non-color buffers: depth, stencil, window control
OpenGL treats these buffers as part of frame bufferSome reserve “frame buffer” for color imagesShould be n-buffered in some cases (sort last)RealityEngine frame buffer can be deeper than wide or high
History cycles through this definition
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
2-D manipulation3-D painters algorithm3-D depth, stencil, accumulation, multi-passProgrammable shading
4/18/2007
5
Frame buffer is optional
Calligraphic display
If we don’t define display list as frame buffer
“Follow-the-beam” rendering
Minimizes latency
Saves cost if frames are never “dropped”
Talisman-like image assembly (3-D sprites)
Old idea (visual simulation, window systems)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Old idea (visual simulation, window systems)
GigaPixel render tile
Frame buffer stores color images only
Depth, stencil, etc. in small tile
Dominant architecture is consistent
SGI architectures look like
ATI architectures, which look like
NVIDIA architectures
Details are evolving, but big picture remains the same
Why is this?
Simplicity of design
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Simplicity of design
Simplicity of algorithms
Simplicity of immediate-mode approach
4/18/2007
6
Simplicity of design
Frame buffer operationsBlending: merge fragment and pixel colorDepth Buffering: save nearest fragmentp g gStencil Buffering: simple pixel state machineAccumulation Buffering: high-resolution color arithmeticAntialiasing: (to be covered later)….
All frame buffer operations:Combine fragment and pixel data (not just a replace)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Combine fragment and pixel data (not just a replace)But replace operation is optimized, e.g., no parity/ECC
Are local (no intra-pixel dependencies)
Why aren’t fragment operations programmable?
Simplicity of algorithms
Frame buffer employs brute-force simplicity
Hidden surface elimination: Depth-buffer vs. sort/painter
Capping: Stencil-based vs object calculationsCapping: Stencil based vs. object calculations
Image-space algorithm is efficientJust samples, never “object” information, locality
Just-in-time calculation, steady cost function
Accumulation Buffer (high-resolution color arithmetic)
The Accumulation Buffer, Haeberli and Akeley, Proceedings of SIGGRAPH ‘90
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Proceedings of SIGGRAPH 90
Volume rendering using 3D textures
Multi-pass rendering
Interactive Multi-pass Programmable Shading, Peercy, Olano, Airey, and Ungar, Proceedings of SIGGRAPH ‘00
4/18/2007
7
Simplicity of immediate-mode
Frame buffer contents are “context”
Matches 2D/window-rendering model
Rendering
System
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Frame buffer: most graphics
state here
Little graphics state here
Decreasing display bandwidth burden
Historically display bandwidth was a limiting factor
Hence “Sproull’s Rule”: fill rate >= display rate
Now display bandwidth is almost inconsequential
Year System FB (GB) Disp (GB) Disp / FB
1984 SGI 2000-series 0.3 0.14 1/21988 SGI GTX 1.8 * 0.29 1/61996 SGI InfiniteReality 12 8 0 60 1/20
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
1996 SGI InfiniteReality 12.8 0.60 1/202006 NVIDIA 7900 GTX 51.2 0.75 1/70
* VRAM provided separate video bandwidth
4/18/2007
8
Parallelism and Communication
Parallelism and communication
Parallelism – using multiple computational units to processes work in parallel
Communication – connecting the computational units to Communication – connecting the computational units to allow work to be distributed and aggregated
Issues
DependenciesOrdering
Sorting
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sorting
ScalabilityComputation
Bandwidth
Load balancing
4/18/2007
9
Parallelism taxonomy
Hardware parallelism(simultaneous execution on multiple processors)
Virtual parallelism(time sharing a single processor, usually with hardware support)
Data parallelism[aka “parallelism”](same task on similardata sets)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Task parallelism(different tasks on similar OR differing data sets)
Parallelism taxonomy
Hardware parallelism(simultaneous execution on multiple processors)
Virtual parallelism(time sharing a single processor, usually with hardware support)
Data parallelism[aka “parallelism”](same task on similardata sets)
Frame-parallelism(batch, SGI N-clops)
Object-parallelism(geometry)
Image-parallelism(fragment/pixel)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Task parallelism(different tasks on similar OR differing data sets)
4/18/2007
10
Parallelism taxonomy
Hardware parallelism(simultaneous execution on multiple processors)
Virtual parallelism(time sharing a single processor, usually with hardware support)
Data parallelism[aka “parallelism”](same task on similardata sets)
Frame-parallelism(batch, SGI N-clops)
Object-parallelism(geometry)
Image-parallelism(fragment/pixel)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Task parallelism(different tasks on similar OR differing data sets)
Multi-processing(on multiple CPUs)
Pipelining(the graphics pipeline)
Parallelism taxonomy
Hardware parallelism(simultaneous execution on multiple processors)
Virtual parallelism(time sharing a single processor, usually with hardware support)
Data parallelism[aka “parallelism”](same task on similardata sets)
Frame-parallelism(batch, SGI N-clops)
Object-parallelism(geometry)
Image-parallelism(fragment/pixel)
Multi-processing(graphics context switching)
Multi-threading(almost defines a GPU-like processor)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Task parallelism(different tasks on similar OR differing data sets)
Multi-processing(on multiple CPUs)
Pipelining(the graphics pipeline)
4/18/2007
11
Parallelism taxonomy
Hardware parallelism(simultaneous execution on multiple processors)
Virtual parallelism(time sharing a single processor, usually with hardware support)
Data parallelism[aka “parallelism”](same task on similardata sets)
Frame-parallelism(batch, SGI N-clops)
Object-parallelism(geometry)
Image-parallelism(fragment/pixel)
Multi-processing(graphics context switching)
Multi-threading(almost defines a GPU-like processor)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Task parallelism(different tasks on similar OR differing data sets)
Multi-processing(on multiple CPUs)
Pipelining(the graphics pipeline)
Multi-processing(time sharing a single CPU)
Multi-threading (Direct3D-10 “common-core”)
Graphics is embarrassingly parallel
Ample self-similar data sets …
Frames, vertexes, fragments, texels, pixels
With minimal dependenciesWith minimal dependencies
Few intra-set dependenciesPixels (in the frame buffer) are the significant exception
Inter-set dependencies are purely sequential
“Graphics pipeline” is designed to minimize dependencies
Other graphics architectures have more dependencies
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
E.g., for global lighting effects
But graphics pipeline has huge redundanciesHence many opportunities for optimization …
How hard should we work to do things wrong ?
4/18/2007
12
Geometry parallelism trend (SGI)
20
0
5
10
15ModelTransform LengthTransform Width
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
0
1000
2000 G
GTXVGX RE IR
Image parallelism trend (SGI)
Rasterization
100
200
300
400
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
01000 2000 G GTX VGX RE IR
4/18/2007
13
The clear trend
Shorter and wider
Why ?
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Why ?
Communication taxonomy
Sorting
Object ImageDistribution Routing(Introduced by
parallelism)(Introduced by
parallelism)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Texturing
Fundamental
4/18/2007
14
Sorting is fundamentalSorting
Object ImageDistribution Routing
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
TexturingI. E. Sutherland, R. F. Sproull, and R. A. Schumacher, A characterization of ten hidden surface algorithms
Classified by order of x, y, z radix sorts
Pipelining vs. parallelism
Issue Task Parallelism(pipelining) Data Parallelism
Orderingdependencies Easy Challenging
Sortingdependencies Easy Challenging
Computationscalability
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Bandwidthscalability
Load balancingscalability
4/18/2007
15
Pipelining vs. parallelism
Issue Task Parallelism(pipelining) Data Parallelism
Orderingdependencies Easy Challenging
Sortingdependencies Easy Challenging
Computationscalability Poor Challenging
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Bandwidthscalability Poor Challenging
Load balancingscalability (Nearly) impossible Challenging
Ordering challenges
Fundamental:
Frame buffer operations’ l hPainter’s algorithm
Memory hazards
Texture writesRender Copy to texture Render
Readback
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
From pipelining:
Changes to graphics state
4/18/2007
16
Sorting taxonomy
Application
Command Sort First
Sort-MiddleGeometry
Rasterization
Texture
Command
Sort-Last Fragment
Sort-First
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-Last Image CompositionFragment
Display
Sort-First
4/18/2007
17
Sort-first
Cmd
App
Cmd
App Pre-transformation
Geom
Rast
CmdPoint-to-point communication scales
Coarse tiling incurs load imbalanceTex
Geom
Rast
Cmd
Tex
SORT
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Frag
Disp
Princeton Display Wall, Stanford WireGL
Frag
Disp
Sort-first
Order
Automatic (conceptually)
Sort
Pre-stage (cheat ☺)
Compute scalability
Good
Bandwidth scalability
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Bandwidth scalability
Good
Load balance scalability
Poor
4/18/2007
18
Sort-first
Cmd
App
Cmd
App
Geom
Rast
Cmd
Tex
Geom
Rast
Cmd
Tex
SORT
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
FragFrag
Disp
ROUTE
DIST
Ring parallelism
Cmd
App
Cmd Cmd CmdDIST DIST DIST…Geom
Rast
Cmd
Tex
Geom
Rast
Cmd
Tex
Geom
Rast
Cmd
Tex
Geom
Rast
Cmd
Tex
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Disp
3DLABs
FragROUTE
Frag Frag Frag
4/18/2007
19
Sort-Middle
Image-space work distribution
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Fuchs - InterleavedParke - Tiled
4/18/2007
20
Sort-middle interleaved
App
CmdCmd Geometry work load-balanced,
DISTRIBUTE
Broadcast communication does not scale, but supports ordering
Finely interleaved screen tilingensures excellent load balance
Geom
Rast
SORT
Tex
Cmd
Geom
Rast
Tex
Cmd y ,except clipping and tesselation
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
e su es e celle t load bala ce
SGI Graphics Workstations: RealityEngine, InfiniteReality
Frag
Disp
ROUTE
Frag
Sort-middle interleaved
Order
Force sequence at triangle sort
Sort
Broadcast
Compute scalability
Good
Bandwidth scalability
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Bandwidth scalability
Limited by sort broadcast
Load balance scalability
Good
4/18/2007
21
SGI RealityEngine
240 MB/s
1600 MB/s
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
3200 MB/s
Sort-middle tiled
App
Cmd
App
Cmd
Geom
Rast
SORTPoint-to-point communication scales
Coarse tiling incurs load imbalanceTex
Cmd
Geom
Rast
Tex
Cmd
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Frag
Disp
ROUTE
UNC PixelPlanes, Stanford Argus
Frag
Disp
4/18/2007
22
Sort-middle tiled (immediate mode)
Order
Force sequence at triangle sort
Sort
Can approach point-to-point
Compute scalability
Good
Bandwidth scalability
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Bandwidth scalability
Good
Load balance scalability
Poor for rasterization (due to large triangles)
Sort-middle tiled (chunked)
OrderForce sequence at triangle sortFull frame delay render to texture difficultiesFull-frame delay, render to texture difficulties
SortCan approach point-to-point
Compute scalabilityGood
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Bandwidth scalabilityGood
Load balance scalabilityGood
4/18/2007
23
UNC Pixel-Planes5 (1990)
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Sort-Last
4/18/2007
24
Sort-last fragment
App
Cmd
App
Cmd Improved texture locality
Point-to-point communication scales, but requires more bw
Geom
Rast
Tex
Cmd
Geom
Rast
Tex
Cmd
SORT
Exposes rasterization load imbalance to application
No redundant work in FG
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007Kubota Denali, E&S Freedom 3000
Finely interleaved screen tilinginsures excellent load balance
Frag
Disp
ROUTE
Frag
Disp Possible, but difficult, to maintain ordering
Sort-last fragment
Order
Force sequence at fragment sort
Sort
Point-to-point, high bandwidth
Compute scalability
Good
Bandwidth scalability
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Bandwidth scalability
OK (sorting is the bottleneck)
Load balance scalability
OK (exposed to application)
4/18/2007
25
Kubota Denali (1993)
TEM48 5 X6
24 X10
FBM
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
FBM
Denali Technical Overview 1.0
Kubota Pacific Computer, 1993
Image composition
Z comp
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Z comp
Other combiners possible
4/18/2007
26
Sort-last image composition
Exposes rasterization load imbalance to application
App
Cmd
App
CmdPoint-to-point ring interconnect scalesGeom
Rast
F
Tex
Cmd
Geom
Rast
F
Tex
Cmd
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007UNC/HP PixelFlow, Aizu VC-1, Stanford Lightning-2
Frag
Disp
Frag
Disp
SORT
Two-stage image compositionloses ordering
Sort-last image composition
Order
Not fully supported !
Sort
One to many for each pipeline
Compute scalability
Excellent
Bandwidth scalability
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Bandwidth scalability
Excellent
Load balance scalability
OK (exposed to application)
4/18/2007
27
UNC Pixel Flow
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
From J. Poulton, J. Eyles, S. Molnar, H. Fuchs,
Pixel Flow: The Realization
4/18/2007
28
Sort-Everywhere
Sort-everywhere: Pomegranate
Cmd
App
Cmd
App
Cmd
Geom
Rast
TexMem
Cmd
Geom
Rast
Tex Mem
DISTRIBUTE
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Frag
Disp
Mem Frag
Disp
Mem
SORT
ROUTE
4/18/2007
29
Architecture comparison
inte
rlea
ved
tile
d (i
mm
d)
tile
d (c
hunk
)
gmen
t
age
com
p.
hereX indicates
an issue
Ordered (X) X
Compute
Sort
-fir
st
Sort
-mid
dle
i
Sort
-mid
dle
t
Sort
-mid
dle
t
Sort
-las
t fr
ag
Sort
-las
t im
a
Sort
-eve
ryw
han issue
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
ComputescalabilityBandwidthscalability X X
Load balancescalability X X X X
Summary
GPU architecture trend
Pipeline hardware-parallel virtual-parallel
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
4/18/2007
30
Readings
Required
1. S. Molnar, M. Cox, D. Ellsworth, H. Fuchs, A sorting classification of parallel renderingclassification of parallel rendering
2. Fuchs et al., A heterogenous multiprocessor graphics system using processor-enhanced memories (PP5).
3. Eyles et al., PixelFlow: The Realization
Recommended
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
1. F. I. Parke, Simulation and expected performance analysis of multiple processor z-buffer systems
2. H. Fuchs, Distributing a visible surface algorithm over multiple processors
Real-Time Graphics Architecture
Lecture 4: Parallelism and Communication
p
Kurt Akeley
CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007
Kurt Akeley
Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/