GeForce4 - old.hotchips.org
Transcript of GeForce4 - old.hotchips.org
GeForce4GeForce4John MontrymHenry Moreton
1
Architectural Drivers
ProgrammabilityParallelismMemory bandwidth
2
another ligh ting im age
Recent History: GeForce 1&2First integrated geometry engine & 4 pixels/clkFixed-function transform, lighting, and pixel pipelines25M transistors : 0.18um/6LM : 250MHz25M polygons/sec : 1G pixels/sec
3
Rendering in Transition
Pre-2001: pixel “painting”Image complexity and richness from LOTS of pixelsEach pixel derived from 1-2 textures & blendingDetail added by transparency and layers
Post-2001 fork in the road:Paint more simple pixels, faster - embedded DRAM ORUse Programmable Shading to render “better” pixels - but, must reduce depth complexity
4
Examples
5
GPUs vs. CPUs
More independent calculationsEnables wide and deep parallelism
API churnshorter development cycles -> ASIC
Blend of general- and special- purpose compute resourcesBoth transistor-bound for the foreseeable future
6
Special-Purpose Hardware
Most efficient implementations ofCube environment mapShadow calculationsAnisotropic filteringClippingRasterizationLog, exp, dot-product
More programmability won’t change this
7
Managing DRAM Bandwidth
Very large working setsAffordable caches cannot support long-term reuseTarget is effective streaming with local reuseSupercomputer techniques apply
Latency hidingVector operations
8
Fit the Machine to DRAM Characteristics
Page localityDRAMs pages are 1-DGraphics accesses are 1-D, 2-D, 3-D in memory
Page misses and read/write turns getting more expensiveGranularity
9
Unified Memory
Traffic comprises:Commands primitives,stateVertex data {x,y,z,w}Texture samplesDepth valuesColors
Relative amounts vary widelyPowerful programming model
10
Eliminate Redundant Traffic
We cache dataTextures, vertices
We cache workPixel fragments
Designed for 80 - 90% hit rate (not 99.9%)Leverage coherence
Engines traverse locallyex: rasterization order
11
Amplify Peak Bandwidth
Lossless compressionLossy compression
Conservative (undetectable)Else application must be given the choice
Small-grained random access favorsfixed compression atoms
12
Reduce Dead Cycles
QueuingAmortize read/write turns over longer runs
Multi-bank DRAM scheduling
13
Embedded DRAM
Tempting to embed megabytes of DRAM.But ..
Cannot fit the whole problemCosts are huge
Will be “just around the corner” for a long time
14
A Tour of the GeForce4
Texture
Host / Front End / Vertex Processor
Fram
e B
uffe
r Con
trol
ler
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
process commandsconvert to FP
transform verticesto screen-space
generate pixels
delete pixels thatcannot be seen
determine the colors, transparenciesand depth of the pixel
combine colors and transparencies toproduce final output RGBA
do final hidden surface test, blendand write out color and new depth
15
Host / Front End / Vertex Processor
Protocol and physical interface to PCI/AGP
Command “ABI” interpreter
Context switch
DMA gather
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
16
Transform and Lighting
Handles persistent attributesDispatchHides latency from the programmerFixed-function modes driven by APIsMultiple vector floating point processors
256 x 128 context RAM12 x 128 temp regs16 x128 input and output
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
17
Vertex Program Examples
AnimationMorphingInterpolation
Lens Effects
Range-based FogElevation-based Fog
DeformationWarpingProcedural Animation
18
Primitive Assembly, Setup & Rasterizer
Per-triangle parameter setupTile walkingSample inclusion determinationTiles are traversed in memory page friendly order
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
19
Occlusion Culling & Programmable Shading
Occlusion Culling reduces Depth ComplexityCalculate Z and determine visible pixelsEliminate invisible pixels
Programmable Shading enables richer visual qualityAccurately model: reflections, shadows, materialsMore textures/pixelMore calculations/pixel – consumes many cycles
Programmable Shading impractical without Occlusion Culling
20
Occlusion Strategies
Possibilities:Maintain local conservative data structureUse actual depth buffer dataOr combine the techniques
A coherence problem no matter how you slice it.API depth test is at the far end of the pipe!
Must preserve semantics
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
21
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
Pixel Shading / Texturing
A pixel shader converts texture coordinates into a color using a shader program.
Floating point mathTexture lookupsResults of previous pixel shaders
4 stages, 1 texture address op per stageCompressed, mipmapped 3-D texturesTrue reflective bump mappingTrue dependent textures (lookup tables)Full 3×3 transform with cubemap or 3-D texture lookup16-bit-per-component normal maps
22
Pixel Shader
Input: values interpolated across triangleIEEE floating point operationsLookup functions using textures
Large, multi-dimensional tablesFiltered
Outputs an ARGB value that register combiners can read
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
23
Input Input Input Input
Sum
Output Output Output
! !
! !
A* BA• B
! !
! !
A* BA• B
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
Register Combiners
1–8 stages, plus a final combinerUp to 4 inputs from texture stages, interpolators, constant registers, earlier combinersFixed set of operations:
Each stage can evaluate A*B+C*D and output result, along with A*B, C*DAlternatively, each stage can evaluate dot products instead of multipliesCan conditionally select A*B or C*D
24
Pixel Shading effects
Multi-texturingDot products for per pixel lighting calculationsReflectionsShadowingCustom effectsPixel math
25
Texture
Deeply pipelined cacheMany hits and misses in flight
Compression4:1 ratioPalettesLossy small-grained fixed ratio scheme
FilteringBilinear, tri-linear, 8:1 anisotropic
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
26
Pixel Engines (ROP)
Coalesces shader pixels into memory access grain
Performs visibility and blending / transparency calculations Balanced processing power vs. bandwidth
Bandwidth is amplified by compression
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
27
Multisample Antialiasing
Transparent to the application2 & 4 subsamples per pixel2, 4, 5, and 9-tap reconstruction filters
28
Intuitive user interfaceEasy set-upApplication ManagementMulti Desktop Support
Flexible display combinations+
+
+
+
+
Window ManagementWindow EffectsCustom User Profiles
nView Display Technology
29
Framebuffer Controller
128-pin DDRSchedules requests from all enginesTransparent compress/decompressMaps from pixel-linear address to page & partition tilingFlexible in:
WidthDepthFrequencyBanks
Texture
Host / Front End / Vertex Processor
Fram
e Bu
ffer C
ontr
olle
r
Transform and Lighting
Primitive Assembly, Setup & Rasterizer
Occlusion Culler
Pixel Shader
Register Combiners
Pixel Engines (ROP)
30
Statistics
136M vertices per second60M triangles per second4.8G samples/sec1.2T ops/sec83.2 GB/sec clear BW63M transistorsTSMC 0.15u300 MHz pipeline / 325 MHz memory clk
31
Conclusions
32
Questions
33
Related reading
Stanford special topics course Real-Time Graphics Architectureswith Kurt Akeley & Pat Hanrahan
Reading list:http://graphics.stanford.edu/courses/cs448a-01-fall/readings.html