Real-Time Graphics Architecture

4/18/2007

1

Real-Time Graphics Architecture

Lecture 4: Parallelism and Communication

p

Kurt Akeley

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007

Kurt Akeley

Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/

Topics

1. Frame buffers

2. Types of parallelism

3. Communication patterns and requirements

4. Sorting classification for parallel rendering (with examples)


4/18/2007

2

Frame Buffers

Raster vs. calligraphic

Raster (image order)

dominant choice

Calligraphic (object order)

Earliest choice (Sketchpad)

E&S terminals in the 70s and 80s

Works with light pens

Scene complexity affects frame rate


Scene complexity affects frame rate

Monitors are expensive

Still required for FAA simulationIncreases absolute brightness of light points

4/18/2007

3

Frame buffer definitions

What is a frame buffer?

What can we learn by considering different definitions?


Frame buffer definition #1

Storage for commands that are executed to refresh the display

Allows for raster or calligraphic display (e g Megatech)Allows for raster or calligraphic display (e.g. Megatech)

“Frame buffer” for calligraphic display is a “display list”

OpenGL “render list”?

Key point: frame buffer contents are interpreted

Color mapping


Image scaling, warping

Window system (overlay, separate windows, …)

Address Recalculation Pipeline

4/18/2007

4


Image memory used to decouple the render frame rate from the display frame rate

Meets common understanding of frame buffer as imageMeets common understanding of frame buffer as image

Leads naturally to double buffering

One render buffer, one display buffer, swap

n-buffering also possible, can control latency

Key idea: decoupling enables general-purpose GPU


Visual simulation has high render frame rate

MCAD has low render frame rate

Window manager has no frame rate


All pixel-assigned memory used to assemble and display the images being rendered

Key point: frame buffer is active participant in renderingLeads to non-color buffers: depth, stencil, window control

OpenGL treats these buffers as part of frame bufferSome reserve “frame buffer” for color imagesShould be n-buffered in some cases (sort last)RealityEngine frame buffer can be deeper than wide or high

History cycles through this definition


2-D manipulation3-D painters algorithm3-D depth, stencil, accumulation, multi-passProgrammable shading

4/18/2007

5

Frame buffer is optional

Calligraphic display

If we don’t define display list as frame buffer

“Follow-the-beam” rendering

Minimizes latency

Saves cost if frames are never “dropped”

Talisman-like image assembly (3-D sprites)

Old idea (visual simulation, window systems)


Old idea (visual simulation, window systems)

GigaPixel render tile

Frame buffer stores color images only

Depth, stencil, etc. in small tile

Dominant architecture is consistent

SGI architectures look like

ATI architectures, which look like

NVIDIA architectures

Details are evolving, but big picture remains the same

Why is this?

Simplicity of design



Simplicity of algorithms

Simplicity of immediate-mode approach

4/18/2007

6


Frame buffer operationsBlending: merge fragment and pixel colorDepth Buffering: save nearest fragmentp g gStencil Buffering: simple pixel state machineAccumulation Buffering: high-resolution color arithmeticAntialiasing: (to be covered later)….

All frame buffer operations:Combine fragment and pixel data (not just a replace)


Combine fragment and pixel data (not just a replace)But replace operation is optimized, e.g., no parity/ECC

Are local (no intra-pixel dependencies)

Why aren’t fragment operations programmable?

Simplicity of algorithms

Frame buffer employs brute-force simplicity

Hidden surface elimination: Depth-buffer vs. sort/painter

Capping: Stencil-based vs object calculationsCapping: Stencil based vs. object calculations

Image-space algorithm is efficientJust samples, never “object” information, locality

Just-in-time calculation, steady cost function

Accumulation Buffer (high-resolution color arithmetic)

The Accumulation Buffer, Haeberli and Akeley, Proceedings of SIGGRAPH ‘90


Proceedings of SIGGRAPH 90

Volume rendering using 3D textures

Multi-pass rendering

Interactive Multi-pass Programmable Shading, Peercy, Olano, Airey, and Ungar, Proceedings of SIGGRAPH ‘00

4/18/2007

7

Simplicity of immediate-mode

Frame buffer contents are “context”

Matches 2D/window-rendering model

Rendering

System


Frame buffer: most graphics

state here

Little graphics state here

Decreasing display bandwidth burden

Historically display bandwidth was a limiting factor

Hence “Sproull’s Rule”: fill rate >= display rate

Now display bandwidth is almost inconsequential

Year System FB (GB) Disp (GB) Disp / FB

1984 SGI 2000-series 0.3 0.14 1/21988 SGI GTX 1.8 * 0.29 1/61996 SGI InfiniteReality 12 8 0 60 1/20


1996 SGI InfiniteReality 12.8 0.60 1/202006 NVIDIA 7900 GTX 51.2 0.75 1/70

* VRAM provided separate video bandwidth

4/18/2007

8

Parallelism and Communication

Parallelism and communication

Parallelism – using multiple computational units to processes work in parallel

Communication – connecting the computational units to Communication – connecting the computational units to allow work to be distributed and aggregated

Issues

DependenciesOrdering

Sorting


Sorting

ScalabilityComputation

Bandwidth

Load balancing

4/18/2007

9

Parallelism taxonomy

Hardware parallelism(simultaneous execution on multiple processors)

Virtual parallelism(time sharing a single processor, usually with hardware support)

Data parallelism[aka “parallelism”](same task on similardata sets)


Task parallelism(different tasks on similar OR differing data sets)





Frame-parallelism(batch, SGI N-clops)

Object-parallelism(geometry)

Image-parallelism(fragment/pixel)



4/18/2007

10










Multi-processing(on multiple CPUs)

Pipelining(the graphics pipeline)








Multi-processing(graphics context switching)

Multi-threading(almost defines a GPU-like processor)





4/18/2007

11








Multi-processing(graphics context switching)

Multi-threading(almost defines a GPU-like processor)





Multi-processing(time sharing a single CPU)

Multi-threading (Direct3D-10 “common-core”)

Graphics is embarrassingly parallel

Ample self-similar data sets …

Frames, vertexes, fragments, texels, pixels

With minimal dependenciesWith minimal dependencies

Few intra-set dependenciesPixels (in the frame buffer) are the significant exception

Inter-set dependencies are purely sequential

“Graphics pipeline” is designed to minimize dependencies

Other graphics architectures have more dependencies


E.g., for global lighting effects

But graphics pipeline has huge redundanciesHence many opportunities for optimization …

How hard should we work to do things wrong ?

4/18/2007

12

Geometry parallelism trend (SGI)

20

0

5

10

15ModelTransform LengthTransform Width


0

1000

2000 G

GTXVGX RE IR

Image parallelism trend (SGI)

Rasterization

100

200

300

400


01000 2000 G GTX VGX RE IR

4/18/2007

13

The clear trend

Shorter and wider

Why ?


Why ?

Communication taxonomy

Sorting

Object ImageDistribution Routing(Introduced by

parallelism)(Introduced by

parallelism)


Texturing

Fundamental

4/18/2007

14

Sorting is fundamentalSorting

Object ImageDistribution Routing


TexturingI. E. Sutherland, R. F. Sproull, and R. A. Schumacher, A characterization of ten hidden surface algorithms

Classified by order of x, y, z radix sorts

Pipelining vs. parallelism

Issue Task Parallelism(pipelining) Data Parallelism

Orderingdependencies Easy Challenging

Sortingdependencies Easy Challenging

Computationscalability


Bandwidthscalability

Load balancingscalability

4/18/2007

15

Pipelining vs. parallelism

Issue Task Parallelism(pipelining) Data Parallelism

Orderingdependencies Easy Challenging

Sortingdependencies Easy Challenging

Computationscalability Poor Challenging


Bandwidthscalability Poor Challenging

Load balancingscalability (Nearly) impossible Challenging

Ordering challenges

Fundamental:

Frame buffer operations’ l hPainter’s algorithm

Memory hazards

Texture writesRender Copy to texture Render

Readback


From pipelining:

Changes to graphics state

4/18/2007

16

Sorting taxonomy

Application

Command Sort First

Sort-MiddleGeometry

Rasterization

Texture

Command

Sort-Last Fragment

Sort-First


Sort-Last Image CompositionFragment

Display

Sort-First

4/18/2007

17

Sort-first

Cmd

App

Cmd

App Pre-transformation

Geom

Rast

CmdPoint-to-point communication scales

Coarse tiling incurs load imbalanceTex

Geom

Rast

Cmd

Tex

SORT


Frag

Disp

Princeton Display Wall, Stanford WireGL

Frag

Disp

Sort-first

Order

Automatic (conceptually)

Sort

Pre-stage (cheat ☺)

Compute scalability

Good

Bandwidth scalability



Good

Load balance scalability

Poor

4/18/2007

18

Sort-first

Cmd

App

Cmd

App

Geom

Rast

Cmd

Tex

Geom

Rast

Cmd

Tex

SORT


FragFrag

Disp

ROUTE

DIST

Ring parallelism

Cmd

App

Cmd Cmd CmdDIST DIST DIST…Geom

Rast

Cmd

Tex

Geom

Rast

Cmd

Tex

Geom

Rast

Cmd

Tex

Geom

Rast

Cmd

Tex


Disp

3DLABs

FragROUTE

Frag Frag Frag

4/18/2007

19

Sort-Middle

Image-space work distribution


Fuchs - InterleavedParke - Tiled

4/18/2007

20

Sort-middle interleaved

App

CmdCmd Geometry work load-balanced,

DISTRIBUTE

Broadcast communication does not scale, but supports ordering

Finely interleaved screen tilingensures excellent load balance

Geom

Rast

SORT

Tex

Cmd

Geom

Rast

Tex

Cmd y ,except clipping and tesselation


e su es e celle t load bala ce

SGI Graphics Workstations: RealityEngine, InfiniteReality

Frag

Disp

ROUTE

Frag

Sort-middle interleaved

Order

Force sequence at triangle sort

Sort

Broadcast

Compute scalability

Good




Limited by sort broadcast


Good

4/18/2007

21

SGI RealityEngine

240 MB/s

1600 MB/s


3200 MB/s

Sort-middle tiled

App

Cmd

App

Cmd

Geom

Rast

SORTPoint-to-point communication scales

Coarse tiling incurs load imbalanceTex

Cmd

Geom

Rast

Tex

Cmd


Frag

Disp

ROUTE

UNC PixelPlanes, Stanford Argus

Frag

Disp

4/18/2007

22

Sort-middle tiled (immediate mode)

Order

Force sequence at triangle sort

Sort

Can approach point-to-point

Compute scalability

Good




Good


Poor for rasterization (due to large triangles)

Sort-middle tiled (chunked)

OrderForce sequence at triangle sortFull frame delay render to texture difficultiesFull-frame delay, render to texture difficulties

SortCan approach point-to-point

Compute scalabilityGood


Bandwidth scalabilityGood

Load balance scalabilityGood

4/18/2007

23

UNC Pixel-Planes5 (1990)


Sort-Last

4/18/2007

24

Sort-last fragment

App

Cmd

App

Cmd Improved texture locality

Point-to-point communication scales, but requires more bw

Geom

Rast

Tex

Cmd

Geom

Rast

Tex

Cmd

SORT

Exposes rasterization load imbalance to application

No redundant work in FG

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007Kubota Denali, E&S Freedom 3000

Finely interleaved screen tilinginsures excellent load balance

Frag

Disp

ROUTE

Frag

Disp Possible, but difficult, to maintain ordering

Sort-last fragment

Order

Force sequence at fragment sort

Sort

Point-to-point, high bandwidth

Compute scalability

Good




OK (sorting is the bottleneck)


OK (exposed to application)

4/18/2007

25

Kubota Denali (1993)

TEM48 5 X6

24 X10

FBM


FBM

Denali Technical Overview 1.0

Kubota Pacific Computer, 1993

Image composition

Z comp


Z comp

Other combiners possible

4/18/2007

26

Sort-last image composition

Exposes rasterization load imbalance to application

App

Cmd

App

CmdPoint-to-point ring interconnect scalesGeom

Rast

F

Tex

Cmd

Geom

Rast

F

Tex

Cmd

CS448 Lecture 4 Kurt Akeley, Pat Hanrahan, Spring 2007UNC/HP PixelFlow, Aizu VC-1, Stanford Lightning-2

Frag

Disp

Frag

Disp

SORT

Two-stage image compositionloses ordering

Sort-last image composition

Order

Not fully supported !

Sort

One to many for each pipeline

Compute scalability

Excellent




Excellent


OK (exposed to application)

4/18/2007

27

UNC Pixel Flow


From J. Poulton, J. Eyles, S. Molnar, H. Fuchs,

Pixel Flow: The Realization

4/18/2007

28

Sort-Everywhere

Sort-everywhere: Pomegranate

Cmd

App

Cmd

App

Cmd

Geom

Rast

TexMem

Cmd

Geom

Rast

Tex Mem

DISTRIBUTE


Frag

Disp

Mem Frag

Disp

Mem

SORT

ROUTE

4/18/2007

29

Architecture comparison

inte

rlea

ved

tile

d (i

mm

d)

tile

d (c

hunk

)

gmen

t

age

com

p.

hereX indicates

an issue

Ordered (X) X

Compute

Sort

-fir

st

Sort

-mid

dle

i

Sort

-mid

dle

t

Sort

-mid

dle

t

Sort

-las

t fr

ag

Sort

-las

t im

a

Sort

-eve

ryw

han issue


ComputescalabilityBandwidthscalability X X

Load balancescalability X X X X

Summary

GPU architecture trend

Pipeline hardware-parallel virtual-parallel


4/18/2007

30

Readings

Required

1. S. Molnar, M. Cox, D. Ellsworth, H. Fuchs, A sorting classification of parallel renderingclassification of parallel rendering

2. Fuchs et al., A heterogenous multiprocessor graphics system using processor-enhanced memories (PP5).

3. Eyles et al., PixelFlow: The Realization

Recommended


1. F. I. Parke, Simulation and expected performance analysis of multiple processor z-buffer systems

2. H. Fuchs, Distributing a visible surface algorithm over multiple processors

Real-Time Graphics Architecture

Lecture 4: Parallelism and Communication

p

Kurt Akeley


Kurt Akeley

Pat Hanrahanhttp://graphics.stanford.edu/cs448-07-spring/

Real-Time Graphics Architecture

Documents

Transcript of Real-Time Graphics Architecture