Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott...

Polygon Rendering on a Polygon Rendering on a Stream ArchitectureStream Architecture

John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery

Concurrent VLSI Architecture Group

Computer Systems Laboratory

Stanford University

Today’s Best HardwareToday’s Best Hardware• Commercial hardware:

Fast Cheap Ubiquitous

• Flexibility limited

• OpenGL scenes: Programmable streams deliver comparable performance.

Frame from Quake 3 Arena, © id Software

Today’s Best SoftwareToday’s Best Software

• Today’s software solutions: Powerful and flexible Slow

• OpenGL scenes: Streams deliver 20x performance.

Frame from A Bug’s Life, © Pixar Animation Studios, 1998

The VisionThe Vision

+

•Performance of a special-purpose processor

•Programmability of a general-purpose processor

•“Real-Time Renderman”

OutlineOutline

• What is stream processing?

• The Imagine architecture

• Polygon rendering on a stream architecture

• Results

• Conclusions

Kernels and StreamsKernels and Streams

• A stream is a set of elements of an arbitrary datatype.

• A computational kernel operates on streams.

Kernel

Streams

Transform

Stream ProcessingStream Processing• All data is streams!

• 2 levels of programming: Stream-level code Kernel-level code

Transform

Shader

ZBuffer

Zcompare

ColorBuffer

z,color

z

z,color

offset

Media Apps and StreamsMedia Apps and Streams• Producer-consumer locality

• High arithmetic requirements

• Homogeneous computation Efficient control Data parallelism

… poor match for microprocessors

Transform

Shader

ZBuffer

Zcompare

ColorBuffer

z,color

z

z,color

offset

The Imagine ArchitectureThe Imagine Architecture

Net

wo

rk

SDRAM SDRAM SDRAM SDRAM

Imagine Stream Processor

Streaming Memory System

StreamController

HostProcessor

NetworkInterfaceStream Register File

ALU

Clu

ster

0

ALU

Clu

ster

1

ALU

Clu

ster

2

ALU

Clu

ster

3

ALU

Clu

ster

4

ALU

Clu

ster

5

ALU

Clu

ster

6

ALU

Clu

ster

7

Mic

roco

ntro

ller

Bandwidth HierarchyBandwidth Hierarchy

4GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

FileALU Cluster

ALU Cluster

ALU Cluster

544GB/s

ALU Cluster

ALU Cluster

ALU Cluster

ALU Cluster

ALU Cluster

SIMD/VLIW Control

Peak BW:

Cluster OrganizationCluster Organization

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Imagine Stats & StatusImagine Stats & Status

• 0.59 cm2 CMOS chip 500 MHz

• Circuits/Logic: expected completion 9/15/00

• Tapeout: expected Q4/2000 Fab: TI GS30KA process

(0.15 m drawn)

Ne

two

rk

SDRAM SDRAM SDRAM SDRAM

Imagine Stream Processor

Streaming Memory System

StreamController

HostProcessor

NetworkInterfaceStream Register File

ALU

Clu

ster

0

ALU

Clu

ster

1

ALU

Clu

ster

2

ALU

Clu

ster

3

ALU

Clu

ster

4

ALU

Clu

ster

5

ALU

Clu

ster

6

ALU

Clu

ster

7

Mic

roco

ntro

ller

Polygon Rendering OutlinePolygon Rendering Outline

• Overview of OpenGL pipeline

• How we map OpenGL into streams & kernels

• How stream operations are sequenced

• How kernels are mapped onto Imagine Use of stream recirculation Detail of 3 steps in the pipeline:

Matrix transformation Scan conversion Enforcing ordering in composite stage

OpenGL Pipeline OverviewOpenGL Pipeline Overview

Application

Geometry

Rasterization

Image Composite

OpenGL:

•Has state

•Requires immediate mode

•Respects ordering

Pipeline DetailPipeline Detail

Transform

GLShader

PrimitiveAssembly

Cull

Project

Geometry

Spanprep

Spangen

Spanrast

TextureLookup

Rasterization

Hash

Z Lookup

Zcompare

Compact

Color, ZWrite

Composite

Image

Input Data

Sort /Merge

Pipeline Stream DatatypesPipeline Stream Datatypes

Transform

GLShader

PrimitiveAssembly

Cull

Project

Geometry

Spanprep

Spangen

Spanrast

TextureLookup

Rasterization

Hash

Z Lookup

Zcompare

Compact

Color, ZWrite

Composite

Image

Sort /Mergevertices

triangles spans fragments

offsets

depths

Most data is floating point.

Stream Recirculation Stream Recirculation

Transform

Memory SRF Clusters

Shader

ZBuffer

Zcompare

ColorBuffer

z,color

z

z,color

offset

• Strip-mining

• Memory accesses: Initial load of vertices Lookup of color/z/texture Writeback of color/z

• All other data accesses are local to the SRF

Stream and Kernel FlowStream and Kernel Flow

xformprojectassemble

rasterize

zcompareZ loadZ storeColor store

Texture load

Vertex load for next batch

xform

CLUSTERS MEM STR 0 MEM STR 1

Excerpt from ADVS-1 run

Mapping Xform to Imagine Mapping Xform to Imagine

RAM

RAM

RAM

RAM

SR

FCluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Transform

Memory SRF Clusters

SR

F

Cluster

Cluster

Cluster

Cluster

Mapping Spanrast to ImagineMapping Spanrast to Imagine

Spanrast

Memory SRF Clusters

Enforcing orderingEnforcing ordering• General sort possible

But too expensive

• Hash much cheaper! Hash function: 12 bits

Low 6 bits of x, low 6 bits of y

Hash table: 212 entries 2 bits/entry 16 words/scratchpad/

cluster

• Compact: Enforces ordering constraint

Compact

Sort

Hash

Merge

Zcompare

Image CompositionImage Composition

RAM

RAM

RAM

RAM

SR

F

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Memory SRF Clusters

Z Buffer

Zcompare

Offset,z,

color

z

z,color

offset

Color Buffer

BenchmarksBenchmarks• ADVS-1: 62k vertices as

point-sampled polygons (SPECviewperf 6.1.1 Advanced Visualizer)

• ADVS-8: mipmapped version of ADVS-1

• Sphere: 82k lit, Gouraud-shaded triangles; 3 positional lights

• Fill: 20k mipmapped 25-pixel triangles

ADVS

Sphere

Experimental setupExperimental setup

• Comparison systems: Microsoft opengl32.dll (sustained) NVIDIA Quadro (sustained) NVIDIA Quadro (peak)

• Test system: 450 MHz PIII Xeon, NT 4.0

• For comparison: Low overhead trace player (no appn. overhead) Average over 100s of frames (no startup costs) Disabled vsync

Results SummaryResults Summary

0.01

0.1

1

10

100

Software(opengl32.dll)

Imaginesustained

NVIDIAsustained

NVIDIA peak

Fra

me

Ra

te N

orm

ali

zed

to

Im

ag

ine

advs-1

sphere

advs-8

fill

Stream-level PerformanceStream-level Performance

• Computation, not memory, bound Highest memory system

occupancy: 58.7%

• Cluster occupancy: 94.3% - 98.8% Reuse

• 5.6 GOPS on Sphere

CLUSTERS MEM STR 0 MEM STR 1

Imagine Kernel BreakdownImagine Kernel Breakdown

0

0.5

1

1.5

2

2.5

3

3.5

4

Mil

lio

ns

of

Ima

gin

e c

yc

les

xform

project

assemblepoly

backfacecull

spanprep

spangen

spanrast

texfilter

hash

sortcompactzcompare

geometry rasterization composite

• Majority of time is in rasterization ADVS-8 has 2.5x ops/frame than ADVS-1

ADVS-8ADVS-8

Future DirectionsFuture Directions

• Extend generality of OpenGL pipeline Add more complex scenes

• Programmable shading and lighting Straightforward to add per-vertex/per-fragment ops Eliminate multipass Goal: “Toolbox” of flexible elements

• Non-polygon rendering: raytracing, IBR, …

• Scalability: multi-Imagine implementations

ConclusionsConclusions

• Streams: Powerful primitive

• Stream architectures: Enable high performance

• Flexibility of general-purpose processor 20x better frame rates than commercial software

• Performance of special-purpose processor Comparable frame rates to commercial hardware

AcknowledgementsAcknowledgements• DARPA

• Industrial sponsors Texas Instruments Intel Corporation

• Matthew Eldridge and Kekoa Proudfoot

• Brian Towles and Brucek Khailany

• Anonymous reviewers for helpful comments

• The US Passport Office same-day turnaround!

Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott...

Documents

Transcript of Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott...