GeForce4 - old.hotchips.org

34
GeForce4 GeForce4 John Montrym Henry Moreton

Transcript of GeForce4 - old.hotchips.org

Page 1: GeForce4 - old.hotchips.org

GeForce4GeForce4John MontrymHenry Moreton

Page 2: GeForce4 - old.hotchips.org

1

Architectural Drivers

ProgrammabilityParallelismMemory bandwidth

Page 3: GeForce4 - old.hotchips.org

2

another ligh ting im age

Recent History: GeForce 1&2First integrated geometry engine & 4 pixels/clkFixed-function transform, lighting, and pixel pipelines25M transistors : 0.18um/6LM : 250MHz25M polygons/sec : 1G pixels/sec

Page 4: GeForce4 - old.hotchips.org

3

Rendering in Transition

Pre-2001: pixel “painting”Image complexity and richness from LOTS of pixelsEach pixel derived from 1-2 textures & blendingDetail added by transparency and layers

Post-2001 fork in the road:Paint more simple pixels, faster - embedded DRAM ORUse Programmable Shading to render “better” pixels - but, must reduce depth complexity

Page 5: GeForce4 - old.hotchips.org

4

Examples

Page 6: GeForce4 - old.hotchips.org

5

GPUs vs. CPUs

More independent calculationsEnables wide and deep parallelism

API churnshorter development cycles -> ASIC

Blend of general- and special- purpose compute resourcesBoth transistor-bound for the foreseeable future

Page 7: GeForce4 - old.hotchips.org

6

Special-Purpose Hardware

Most efficient implementations ofCube environment mapShadow calculationsAnisotropic filteringClippingRasterizationLog, exp, dot-product

More programmability won’t change this

Page 8: GeForce4 - old.hotchips.org

7

Managing DRAM Bandwidth

Very large working setsAffordable caches cannot support long-term reuseTarget is effective streaming with local reuseSupercomputer techniques apply

Latency hidingVector operations

Page 9: GeForce4 - old.hotchips.org

8

Fit the Machine to DRAM Characteristics

Page localityDRAMs pages are 1-DGraphics accesses are 1-D, 2-D, 3-D in memory

Page misses and read/write turns getting more expensiveGranularity

Page 10: GeForce4 - old.hotchips.org

9

Unified Memory

Traffic comprises:Commands primitives,stateVertex data {x,y,z,w}Texture samplesDepth valuesColors

Relative amounts vary widelyPowerful programming model

Page 11: GeForce4 - old.hotchips.org

10

Eliminate Redundant Traffic

We cache dataTextures, vertices

We cache workPixel fragments

Designed for 80 - 90% hit rate (not 99.9%)Leverage coherence

Engines traverse locallyex: rasterization order

Page 12: GeForce4 - old.hotchips.org

11

Amplify Peak Bandwidth

Lossless compressionLossy compression

Conservative (undetectable)Else application must be given the choice

Small-grained random access favorsfixed compression atoms

Page 13: GeForce4 - old.hotchips.org

12

Reduce Dead Cycles

QueuingAmortize read/write turns over longer runs

Multi-bank DRAM scheduling

Page 14: GeForce4 - old.hotchips.org

13

Embedded DRAM

Tempting to embed megabytes of DRAM.But ..

Cannot fit the whole problemCosts are huge

Will be “just around the corner” for a long time

Page 15: GeForce4 - old.hotchips.org

14

A Tour of the GeForce4

Texture

Host / Front End / Vertex Processor

Fram

e B

uffe

r Con

trol

ler

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

process commandsconvert to FP

transform verticesto screen-space

generate pixels

delete pixels thatcannot be seen

determine the colors, transparenciesand depth of the pixel

combine colors and transparencies toproduce final output RGBA

do final hidden surface test, blendand write out color and new depth

Page 16: GeForce4 - old.hotchips.org

15

Host / Front End / Vertex Processor

Protocol and physical interface to PCI/AGP

Command “ABI” interpreter

Context switch

DMA gather

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Page 17: GeForce4 - old.hotchips.org

16

Transform and Lighting

Handles persistent attributesDispatchHides latency from the programmerFixed-function modes driven by APIsMultiple vector floating point processors

256 x 128 context RAM12 x 128 temp regs16 x128 input and output

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Page 18: GeForce4 - old.hotchips.org

17

Vertex Program Examples

AnimationMorphingInterpolation

Lens Effects

Range-based FogElevation-based Fog

DeformationWarpingProcedural Animation

Page 19: GeForce4 - old.hotchips.org

18

Primitive Assembly, Setup & Rasterizer

Per-triangle parameter setupTile walkingSample inclusion determinationTiles are traversed in memory page friendly order

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Page 20: GeForce4 - old.hotchips.org

19

Occlusion Culling & Programmable Shading

Occlusion Culling reduces Depth ComplexityCalculate Z and determine visible pixelsEliminate invisible pixels

Programmable Shading enables richer visual qualityAccurately model: reflections, shadows, materialsMore textures/pixelMore calculations/pixel – consumes many cycles

Programmable Shading impractical without Occlusion Culling

Page 21: GeForce4 - old.hotchips.org

20

Occlusion Strategies

Possibilities:Maintain local conservative data structureUse actual depth buffer dataOr combine the techniques

A coherence problem no matter how you slice it.API depth test is at the far end of the pipe!

Must preserve semantics

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Page 22: GeForce4 - old.hotchips.org

21

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Pixel Shading / Texturing

A pixel shader converts texture coordinates into a color using a shader program.

Floating point mathTexture lookupsResults of previous pixel shaders

4 stages, 1 texture address op per stageCompressed, mipmapped 3-D texturesTrue reflective bump mappingTrue dependent textures (lookup tables)Full 3×3 transform with cubemap or 3-D texture lookup16-bit-per-component normal maps

Page 23: GeForce4 - old.hotchips.org

22

Pixel Shader

Input: values interpolated across triangleIEEE floating point operationsLookup functions using textures

Large, multi-dimensional tablesFiltered

Outputs an ARGB value that register combiners can read

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Page 24: GeForce4 - old.hotchips.org

23

Input Input Input Input

Sum

Output Output Output

! !

! !

A* BA• B

! !

! !

A* BA• B

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Register Combiners

1–8 stages, plus a final combinerUp to 4 inputs from texture stages, interpolators, constant registers, earlier combinersFixed set of operations:

Each stage can evaluate A*B+C*D and output result, along with A*B, C*DAlternatively, each stage can evaluate dot products instead of multipliesCan conditionally select A*B or C*D

Page 25: GeForce4 - old.hotchips.org

24

Pixel Shading effects

Multi-texturingDot products for per pixel lighting calculationsReflectionsShadowingCustom effectsPixel math

Page 26: GeForce4 - old.hotchips.org

25

Texture

Deeply pipelined cacheMany hits and misses in flight

Compression4:1 ratioPalettesLossy small-grained fixed ratio scheme

FilteringBilinear, tri-linear, 8:1 anisotropic

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Page 27: GeForce4 - old.hotchips.org

26

Pixel Engines (ROP)

Coalesces shader pixels into memory access grain

Performs visibility and blending / transparency calculations Balanced processing power vs. bandwidth

Bandwidth is amplified by compression

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Page 28: GeForce4 - old.hotchips.org

27

Multisample Antialiasing

Transparent to the application2 & 4 subsamples per pixel2, 4, 5, and 9-tap reconstruction filters

Page 29: GeForce4 - old.hotchips.org

28

Intuitive user interfaceEasy set-upApplication ManagementMulti Desktop Support

Flexible display combinations+

+

+

+

+

Window ManagementWindow EffectsCustom User Profiles

nView Display Technology

Page 30: GeForce4 - old.hotchips.org

29

Framebuffer Controller

128-pin DDRSchedules requests from all enginesTransparent compress/decompressMaps from pixel-linear address to page & partition tilingFlexible in:

WidthDepthFrequencyBanks

Texture

Host / Front End / Vertex Processor

Fram

e Bu

ffer C

ontr

olle

r

Transform and Lighting

Primitive Assembly, Setup & Rasterizer

Occlusion Culler

Pixel Shader

Register Combiners

Pixel Engines (ROP)

Page 31: GeForce4 - old.hotchips.org

30

Statistics

136M vertices per second60M triangles per second4.8G samples/sec1.2T ops/sec83.2 GB/sec clear BW63M transistorsTSMC 0.15u300 MHz pipeline / 325 MHz memory clk

Page 32: GeForce4 - old.hotchips.org

31

Conclusions

Page 33: GeForce4 - old.hotchips.org

32

Questions

Page 34: GeForce4 - old.hotchips.org

33

Related reading

Stanford special topics course Real-Time Graphics Architectureswith Kurt Akeley & Pat Hanrahan

Reading list:http://graphics.stanford.edu/courses/cs448a-01-fall/readings.html