Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott...
-
Upload
christina-caldwell -
Category
Documents
-
view
224 -
download
10
Transcript of Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott...
Polygon Rendering on a Polygon Rendering on a Stream ArchitectureStream Architecture
John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery
Concurrent VLSI Architecture Group
Computer Systems Laboratory
Stanford University
Today’s Best HardwareToday’s Best Hardware• Commercial hardware:
Fast Cheap Ubiquitous
• Flexibility limited
• OpenGL scenes: Programmable streams deliver comparable performance.
Frame from Quake 3 Arena, © id Software
Today’s Best SoftwareToday’s Best Software
• Today’s software solutions: Powerful and flexible Slow
• OpenGL scenes: Streams deliver 20x performance.
Frame from A Bug’s Life, © Pixar Animation Studios, 1998
The VisionThe Vision
+
•Performance of a special-purpose processor
•Programmability of a general-purpose processor
•“Real-Time Renderman”
OutlineOutline
• What is stream processing?
• The Imagine architecture
• Polygon rendering on a stream architecture
• Results
• Conclusions
Kernels and StreamsKernels and Streams
• A stream is a set of elements of an arbitrary datatype.
• A computational kernel operates on streams.
Kernel
Streams
Transform
Stream ProcessingStream Processing• All data is streams!
• 2 levels of programming: Stream-level code Kernel-level code
Transform
Shader
ZBuffer
Zcompare
ColorBuffer
z,color
z
z,color
offset
Media Apps and StreamsMedia Apps and Streams• Producer-consumer locality
• High arithmetic requirements
• Homogeneous computation Efficient control Data parallelism
… poor match for microprocessors
Transform
Shader
ZBuffer
Zcompare
ColorBuffer
z,color
z
z,color
offset
The Imagine ArchitectureThe Imagine Architecture
Net
wo
rk
SDRAM SDRAM SDRAM SDRAM
Imagine Stream Processor
Streaming Memory System
StreamController
HostProcessor
NetworkInterfaceStream Register File
ALU
Clu
ster
0
ALU
Clu
ster
1
ALU
Clu
ster
2
ALU
Clu
ster
3
ALU
Clu
ster
4
ALU
Clu
ster
5
ALU
Clu
ster
6
ALU
Clu
ster
7
Mic
roco
ntro
ller
Bandwidth HierarchyBandwidth Hierarchy
4GB/s 32GB/s
SDRAM
SDRAM
SDRAM
SDRAM
Str
eam
R
egis
ter
FileALU Cluster
ALU Cluster
ALU Cluster
544GB/s
ALU Cluster
ALU Cluster
ALU Cluster
ALU Cluster
ALU Cluster
SIMD/VLIW Control
Peak BW:
Cluster OrganizationCluster Organization
CU
Inte
rclu
ster
N
etw
ork+
From SRF
To SRF
+ + * * /
Cross Point
Local Register File
Imagine Stats & StatusImagine Stats & Status
• 0.59 cm2 CMOS chip 500 MHz
• Circuits/Logic: expected completion 9/15/00
• Tapeout: expected Q4/2000 Fab: TI GS30KA process
(0.15 m drawn)
Ne
two
rk
SDRAM SDRAM SDRAM SDRAM
Imagine Stream Processor
Streaming Memory System
StreamController
HostProcessor
NetworkInterfaceStream Register File
ALU
Clu
ster
0
ALU
Clu
ster
1
ALU
Clu
ster
2
ALU
Clu
ster
3
ALU
Clu
ster
4
ALU
Clu
ster
5
ALU
Clu
ster
6
ALU
Clu
ster
7
Mic
roco
ntro
ller
Polygon Rendering OutlinePolygon Rendering Outline
• Overview of OpenGL pipeline
• How we map OpenGL into streams & kernels
• How stream operations are sequenced
• How kernels are mapped onto Imagine Use of stream recirculation Detail of 3 steps in the pipeline:
Matrix transformation Scan conversion Enforcing ordering in composite stage
OpenGL Pipeline OverviewOpenGL Pipeline Overview
Application
Geometry
Rasterization
Image Composite
OpenGL:
•Has state
•Requires immediate mode
•Respects ordering
Pipeline DetailPipeline Detail
Transform
GLShader
PrimitiveAssembly
Cull
Project
Geometry
Spanprep
Spangen
Spanrast
TextureLookup
Rasterization
Hash
Z Lookup
Zcompare
Compact
Color, ZWrite
Composite
Image
Input Data
Sort /Merge
Pipeline Stream DatatypesPipeline Stream Datatypes
Transform
GLShader
PrimitiveAssembly
Cull
Project
Geometry
Spanprep
Spangen
Spanrast
TextureLookup
Rasterization
Hash
Z Lookup
Zcompare
Compact
Color, ZWrite
Composite
Image
Sort /Mergevertices
triangles spans fragments
offsets
depths
Most data is floating point.
Stream Recirculation Stream Recirculation
Transform
Memory SRF Clusters
Shader
ZBuffer
Zcompare
ColorBuffer
z,color
z
z,color
offset
• Strip-mining
• Memory accesses: Initial load of vertices Lookup of color/z/texture Writeback of color/z
• All other data accesses are local to the SRF
Stream and Kernel FlowStream and Kernel Flow
xformprojectassemble
rasterize
zcompareZ loadZ storeColor store
Texture load
Vertex load for next batch
xform
CLUSTERS MEM STR 0 MEM STR 1
Excerpt from ADVS-1 run
Mapping Xform to Imagine Mapping Xform to Imagine
RAM
RAM
RAM
RAM
SR
FCluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Transform
Memory SRF Clusters
SR
F
Cluster
Cluster
Cluster
Cluster
Mapping Spanrast to ImagineMapping Spanrast to Imagine
Spanrast
Memory SRF Clusters
Enforcing orderingEnforcing ordering• General sort possible
But too expensive
• Hash much cheaper! Hash function: 12 bits
Low 6 bits of x, low 6 bits of y
Hash table: 212 entries 2 bits/entry 16 words/scratchpad/
cluster
• Compact: Enforces ordering constraint
Compact
Sort
Hash
Merge
Zcompare
Image CompositionImage Composition
RAM
RAM
RAM
RAM
SR
F
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Cluster
Memory SRF Clusters
Z Buffer
Zcompare
Offset,z,
color
z
z,color
offset
Color Buffer
BenchmarksBenchmarks• ADVS-1: 62k vertices as
point-sampled polygons (SPECviewperf 6.1.1 Advanced Visualizer)
• ADVS-8: mipmapped version of ADVS-1
• Sphere: 82k lit, Gouraud-shaded triangles; 3 positional lights
• Fill: 20k mipmapped 25-pixel triangles
ADVS
Sphere
Experimental setupExperimental setup
• Comparison systems: Microsoft opengl32.dll (sustained) NVIDIA Quadro (sustained) NVIDIA Quadro (peak)
• Test system: 450 MHz PIII Xeon, NT 4.0
• For comparison: Low overhead trace player (no appn. overhead) Average over 100s of frames (no startup costs) Disabled vsync
Results SummaryResults Summary
0.01
0.1
1
10
100
Software(opengl32.dll)
Imaginesustained
NVIDIAsustained
NVIDIA peak
Fra
me
Ra
te N
orm
ali
zed
to
Im
ag
ine
advs-1
sphere
advs-8
fill
Stream-level PerformanceStream-level Performance
• Computation, not memory, bound Highest memory system
occupancy: 58.7%
• Cluster occupancy: 94.3% - 98.8% Reuse
• 5.6 GOPS on Sphere
CLUSTERS MEM STR 0 MEM STR 1
Imagine Kernel BreakdownImagine Kernel Breakdown
0
0.5
1
1.5
2
2.5
3
3.5
4
Mil
lio
ns
of
Ima
gin
e c
yc
les
xform
project
assemblepoly
backfacecull
spanprep
spangen
spanrast
texfilter
hash
sortcompactzcompare
geometry rasterization composite
• Majority of time is in rasterization ADVS-8 has 2.5x ops/frame than ADVS-1
ADVS-8ADVS-8
Future DirectionsFuture Directions
• Extend generality of OpenGL pipeline Add more complex scenes
• Programmable shading and lighting Straightforward to add per-vertex/per-fragment ops Eliminate multipass Goal: “Toolbox” of flexible elements
• Non-polygon rendering: raytracing, IBR, …
• Scalability: multi-Imagine implementations
ConclusionsConclusions
• Streams: Powerful primitive
• Stream architectures: Enable high performance
• Flexibility of general-purpose processor 20x better frame rates than commercial software
• Performance of special-purpose processor Comparable frame rates to commercial hardware
AcknowledgementsAcknowledgements• DARPA
• Industrial sponsors Texas Instruments Intel Corporation
• Matthew Eldridge and Kekoa Proudfoot
• Brian Towles and Brucek Khailany
• Anonymous reviewers for helpful comments
• The US Passport Office same-day turnaround!