Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott...

30
Polygon Rendering on a Polygon Rendering on a Stream Architecture Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture Group Computer Systems Laboratory Stanford University

Transcript of Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott...

Page 1: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Polygon Rendering on a Polygon Rendering on a Stream ArchitectureStream Architecture

John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery

Concurrent VLSI Architecture Group

Computer Systems Laboratory

Stanford University

Page 2: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Today’s Best HardwareToday’s Best Hardware• Commercial hardware:

Fast Cheap Ubiquitous

• Flexibility limited

• OpenGL scenes: Programmable streams deliver comparable performance.

Frame from Quake 3 Arena, © id Software

Page 3: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Today’s Best SoftwareToday’s Best Software

• Today’s software solutions: Powerful and flexible Slow

• OpenGL scenes: Streams deliver 20x performance.

Frame from A Bug’s Life, © Pixar Animation Studios, 1998

Page 4: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

The VisionThe Vision

+

•Performance of a special-purpose processor

•Programmability of a general-purpose processor

•“Real-Time Renderman”

Page 5: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

OutlineOutline

• What is stream processing?

• The Imagine architecture

• Polygon rendering on a stream architecture

• Results

• Conclusions

Page 6: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Kernels and StreamsKernels and Streams

• A stream is a set of elements of an arbitrary datatype.

• A computational kernel operates on streams.

Kernel

Streams

Transform

Page 7: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Stream ProcessingStream Processing• All data is streams!

• 2 levels of programming: Stream-level code Kernel-level code

Transform

Shader

ZBuffer

Zcompare

ColorBuffer

z,color

z

z,color

offset

Page 8: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Media Apps and StreamsMedia Apps and Streams• Producer-consumer locality

• High arithmetic requirements

• Homogeneous computation Efficient control Data parallelism

… poor match for microprocessors

Transform

Shader

ZBuffer

Zcompare

ColorBuffer

z,color

z

z,color

offset

Page 9: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

The Imagine ArchitectureThe Imagine Architecture

Net

wo

rk

SDRAM SDRAM SDRAM SDRAM

Imagine Stream Processor

Streaming Memory System

StreamController

HostProcessor

NetworkInterfaceStream Register File

ALU

Clu

ster

0

ALU

Clu

ster

1

ALU

Clu

ster

2

ALU

Clu

ster

3

ALU

Clu

ster

4

ALU

Clu

ster

5

ALU

Clu

ster

6

ALU

Clu

ster

7

Mic

roco

ntro

ller

Page 10: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Bandwidth HierarchyBandwidth Hierarchy

4GB/s 32GB/s

SDRAM

SDRAM

SDRAM

SDRAM

Str

eam

R

egis

ter

FileALU Cluster

ALU Cluster

ALU Cluster

544GB/s

ALU Cluster

ALU Cluster

ALU Cluster

ALU Cluster

ALU Cluster

SIMD/VLIW Control

Peak BW:

Page 11: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Cluster OrganizationCluster Organization

CU

Inte

rclu

ster

N

etw

ork+

From SRF

To SRF

+ + * * /

Cross Point

Local Register File

Page 12: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Imagine Stats & StatusImagine Stats & Status

• 0.59 cm2 CMOS chip 500 MHz

• Circuits/Logic: expected completion 9/15/00

• Tapeout: expected Q4/2000 Fab: TI GS30KA process

(0.15 m drawn)

Ne

two

rk

SDRAM SDRAM SDRAM SDRAM

Imagine Stream Processor

Streaming Memory System

StreamController

HostProcessor

NetworkInterfaceStream Register File

ALU

Clu

ster

0

ALU

Clu

ster

1

ALU

Clu

ster

2

ALU

Clu

ster

3

ALU

Clu

ster

4

ALU

Clu

ster

5

ALU

Clu

ster

6

ALU

Clu

ster

7

Mic

roco

ntro

ller

Page 13: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Polygon Rendering OutlinePolygon Rendering Outline

• Overview of OpenGL pipeline

• How we map OpenGL into streams & kernels

• How stream operations are sequenced

• How kernels are mapped onto Imagine Use of stream recirculation Detail of 3 steps in the pipeline:

Matrix transformation Scan conversion Enforcing ordering in composite stage

Page 14: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

OpenGL Pipeline OverviewOpenGL Pipeline Overview

Application

Geometry

Rasterization

Image Composite

OpenGL:

•Has state

•Requires immediate mode

•Respects ordering

Page 15: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Pipeline DetailPipeline Detail

Transform

GLShader

PrimitiveAssembly

Cull

Project

Geometry

Spanprep

Spangen

Spanrast

TextureLookup

Rasterization

Hash

Z Lookup

Zcompare

Compact

Color, ZWrite

Composite

Image

Input Data

Sort /Merge

Page 16: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Pipeline Stream DatatypesPipeline Stream Datatypes

Transform

GLShader

PrimitiveAssembly

Cull

Project

Geometry

Spanprep

Spangen

Spanrast

TextureLookup

Rasterization

Hash

Z Lookup

Zcompare

Compact

Color, ZWrite

Composite

Image

Sort /Mergevertices

triangles spans fragments

offsets

depths

Most data is floating point.

Page 17: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Stream Recirculation Stream Recirculation

Transform

Memory SRF Clusters

Shader

ZBuffer

Zcompare

ColorBuffer

z,color

z

z,color

offset

• Strip-mining

• Memory accesses: Initial load of vertices Lookup of color/z/texture Writeback of color/z

• All other data accesses are local to the SRF

Page 18: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Stream and Kernel FlowStream and Kernel Flow

xformprojectassemble

rasterize

zcompareZ loadZ storeColor store

Texture load

Vertex load for next batch

xform

CLUSTERS MEM STR 0 MEM STR 1

Excerpt from ADVS-1 run

Page 19: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Mapping Xform to Imagine Mapping Xform to Imagine

RAM

RAM

RAM

RAM

SR

FCluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Transform

Memory SRF Clusters

Page 20: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

SR

F

Cluster

Cluster

Cluster

Cluster

Mapping Spanrast to ImagineMapping Spanrast to Imagine

Spanrast

Memory SRF Clusters

Page 21: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Enforcing orderingEnforcing ordering• General sort possible

But too expensive

• Hash much cheaper! Hash function: 12 bits

Low 6 bits of x, low 6 bits of y

Hash table: 212 entries 2 bits/entry 16 words/scratchpad/

cluster

• Compact: Enforces ordering constraint

Compact

Sort

Hash

Merge

Zcompare

Page 22: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Image CompositionImage Composition

RAM

RAM

RAM

RAM

SR

F

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Cluster

Memory SRF Clusters

Z Buffer

Zcompare

Offset,z,

color

z

z,color

offset

Color Buffer

Page 23: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

BenchmarksBenchmarks• ADVS-1: 62k vertices as

point-sampled polygons (SPECviewperf 6.1.1 Advanced Visualizer)

• ADVS-8: mipmapped version of ADVS-1

• Sphere: 82k lit, Gouraud-shaded triangles; 3 positional lights

• Fill: 20k mipmapped 25-pixel triangles

ADVS

Sphere

Page 24: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Experimental setupExperimental setup

• Comparison systems: Microsoft opengl32.dll (sustained) NVIDIA Quadro (sustained) NVIDIA Quadro (peak)

• Test system: 450 MHz PIII Xeon, NT 4.0

• For comparison: Low overhead trace player (no appn. overhead) Average over 100s of frames (no startup costs) Disabled vsync

Page 25: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Results SummaryResults Summary

0.01

0.1

1

10

100

Software(opengl32.dll)

Imaginesustained

NVIDIAsustained

NVIDIA peak

Fra

me

Ra

te N

orm

ali

zed

to

Im

ag

ine

advs-1

sphere

advs-8

fill

Page 26: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Stream-level PerformanceStream-level Performance

• Computation, not memory, bound Highest memory system

occupancy: 58.7%

• Cluster occupancy: 94.3% - 98.8% Reuse

• 5.6 GOPS on Sphere

CLUSTERS MEM STR 0 MEM STR 1

Page 27: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Imagine Kernel BreakdownImagine Kernel Breakdown

0

0.5

1

1.5

2

2.5

3

3.5

4

Mil

lio

ns

of

Ima

gin

e c

yc

les

xform

project

assemblepoly

backfacecull

spanprep

spangen

spanrast

texfilter

hash

sortcompactzcompare

geometry rasterization composite

• Majority of time is in rasterization ADVS-8 has 2.5x ops/frame than ADVS-1

ADVS-8ADVS-8

Page 28: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

Future DirectionsFuture Directions

• Extend generality of OpenGL pipeline Add more complex scenes

• Programmable shading and lighting Straightforward to add per-vertex/per-fragment ops Eliminate multipass Goal: “Toolbox” of flexible elements

• Non-polygon rendering: raytracing, IBR, …

• Scalability: multi-Imagine implementations

Page 29: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

ConclusionsConclusions

• Streams: Powerful primitive

• Stream architectures: Enable high performance

• Flexibility of general-purpose processor 20x better frame rates than commercial software

• Performance of special-purpose processor Comparable frame rates to commercial hardware

Page 30: Polygon Rendering on a Stream Architecture John D. Owens, William J. Dally, Ujval J. Kapasi, Scott Rixner, Peter Mattson, Ben Mowery Concurrent VLSI Architecture.

AcknowledgementsAcknowledgements• DARPA

• Industrial sponsors Texas Instruments Intel Corporation

• Matthew Eldridge and Kekoa Proudfoot

• Brian Towles and Brucek Khailany

• Anonymous reviewers for helpful comments

• The US Passport Office same-day turnaround!