Modern GPU Architecture

93
Modern GPU Architecture CSE 694G Game Design and Project Prof. Roger Crawfis

Transcript of Modern GPU Architecture

Page 1: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 1/93

Modern GPU Architecture

CSE 694GGame Design and Project

Prof. Roger Crawfis

Page 2: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 2/93

GPU vs CPU

A GPU is tailored for highly parallel operation while aCPU executes programs seriallyFor this reason, GPUs have many parallel executionunits and higher transistor counts, while CPUs havefew execution units and higher clockspeeds

A GPU is for the most part deterministic in itsoperation (though this is quickly changing)GPUs have much deeper pipelines (several thousand

stages vs 10-20 for CPUs)GPUs have significantly faster and more advancedmemory interfaces as they need to shift around a lotmore data than CPUs

Page 3: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 3/93

The GPU pipeline

The GPU receives geometry informationfrom the CPU as an input and provides apicture as an outputLet’s see how that happens

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface

Page 4: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 4/93

Host Interface

The host interface is the communication bridgebetween the CPU and the GPUIt receives commands from the CPU and also pullsgeometry information from system memoryIt outputs a stream of vertices in object space with alltheir associated information (normals, texturecoordinates, per vertex color etc)

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface

Page 5: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 5/93

Vertex Processing

The vertex processing stage receives vertices from thehost interface in object space and outputs them inscreen spaceThis may be a simple linear transformation, or acomplex operation involving morphing effectsNormals, texcoords etc are also transformedNo new vertices are created in this stage, and novertices are discarded (input/output has 1:1 mapping)

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface

Page 6: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 6/93

Triangle setup

In this stage geometry information becomes rasterinformation (screen space geometry is the input, pixelsare the output)Prior to rasterization, triangles that are backfacing orare located outside the viewing frustrum are rejectedSome GPUs also do some hidden surface removal atthis stage

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface

Page 7: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 7/93

Triangle Setup (cont)

A fragment is generated if and only ifits center is inside the triangle

Every fragment generated has itsattributes computed to be theperspective correct interpolation of thethree vertices that make up thetriangle

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface

Page 8: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 8/93

Fragment Processing

Each fragment provided by triangle setup is fed intofragment processing as a set of attributes (position,normal, texcoord etc), which are used to compute thefinal color for this pixelThe computations taking place here include texturemapping and math operationsTypically the bottleneck in modern applications

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface

Page 9: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 9/93

Memory Interface

Fragment colors provided by the previous stage arewritten to the framebufferUsed to be the biggest bottleneck before fragmentprocessing took overBefore the final write occurs, some fragments arerejected by the zbuffer, stencil and alpha testsOn modern GPUs, z and color are compressed toreduce framebuffer bandwidth (but not size)

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface

Page 10: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 10/93

Programmability in the GPU

Vertex and fragment processing, and now triangle set-up, are programmableThe programmer can write programs that are executedfor every vertex as well as for every fragmentThis allows fully customizable geometry and shadingeffects that go well beyond the generic look and feel ofolder 3D applications

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface

Page 11: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 11/93

Diagram of a modern GPU

64bits tomemory

64bits tomemory

64bits tomemory

64bits tomemory

Input from CPU

Host interface

Vertex processing

Triangle setup

Pixel processing

Memory Interface

Page 12: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 12/93

CPU/GPU interaction

The CPU and GPU inside the PC work in parallel witheach otherThere are two ―threads‖ going on, one for the CPU andone for the GPU, which communicate through acommand buffer:

CPU writes commands here

GPU reads commands from here

Pending GPU commands

Page 13: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 13/93

CPU/GPU interaction (cont)

If this command buffer is drained empty, we are CPUlimited and the GPU will spin around waiting for newinput. All the GPU power in the universe isn’t going tomake your application faster!If the command buffer fills up, the CPU will spinaround waiting for the GPU to consume it, and we areeffectively GPU limited

Page 14: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 14/93

CPU/GPU interaction (cont)

Another important point to consider is that programsthat use the GPU do not follow the traditionalsequential execution modelIn the CPU program below, the object is not drawnafter statement A and before statement B:

Instead, all the API call does, is to add the commandto draw the object to the GPU command buffer

•Statement A• API call to draw object•Statement B

Page 15: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 15/93

Synchronization issues

This leads to a number of synchronizationconsiderationsIn the figure below, the CPU must not overwrite thedata in the ―yellow‖ block until the GPU is done withthe ―black‖ command, which references that data:

CPU writes commands here

GPU reads commands from here

data

Page 16: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 16/93

Syncronization issues (cont)

Modern APIs implement semaphore style operationsto keep this from causing problemsIf the CPU attempts to modify a piece of data that isbeing referenced by a pending GPU command, it will

have to spin around waiting, until the GPU is finishedwith that commandWhile this ensures correct operation it is not good forperformance since there are a million other thingswe’d rather do with the CPU instead of spinning The GPU will also drain a big part of the commandbuffer thereby reducing its ability to run in parallel withthe CPU

Page 17: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 17/93

Inlining data

One way to avoid these problems is to inline all data tothe command buffer and avoid references to separatedata:

However, this is also bad for performance, since wemay need to copy several Mbytes of data instead ofmerely passing around a pointer

CPU writes commands here

GPU reads commands from here

data

Page 18: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 18/93

Renaming data

A better solution is to allocate a new data block andinitialize that one instead, the old block will be deletedonce the GPU is done with itModern APIs do this automatically, provided you

initialize the entire block (if you only change a part ofthe block, renaming cannot occur)

Better yet, allocate all your data at startup and don’tchange them for the duration of execution (not alwayspossible, however)

data datadata data

Page 19: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 19/93

GPU readbacks

The output of a GPU is a rendered image on thescreen, what will happen if the CPU tries to read it?

The GPU must be syncronized with the CPU, ie itmust drain its entire command buffer, and the CPUmust wait while this happens

CPU writes commands here

GPU reads commands from here

Pending GPU commands

Page 20: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 20/93

GPU readbacks (cont)

We lose all parallelism, since first the CPU waits forthe GPU, then the GPU waits for the CPU (becausethe command buffer has been drained)Both CPU and GPU performance take a nosedive

Bottom line: the image the GPU produces is for youreyes, not for the CPU (treat the CPU -> GPU highwayas a one way street)

Page 21: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 21/93

Some more GPU tips

Since the GPU is highly parallel and deeply pipelined,try to dispatch large batches with each drawing callSending just one triangle at a time will not occupy allof the GPU’s several vertex/pixel processors, nor will itfill its deep pipelinesSince all GPUs today use the zbuffer algorithm to dohidden surface removal, rendering objects front-to-back is faster than back-to-front (painters algorithm),or random orderingOf course, there is no point in front-to-back sorting ifyou are already CPU limited

Page 22: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 22/93

Graphics Hardware Abstraction

OpenGL and DirectX provide anabstraction of the hardware.

Page 23: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 23/93

CS248 Lecture 14 Kurt Akeley, Fall 2007

Trend from pipeline to data parallelism

CommandProcessor

Round-robinAggregation

Coord, normal

Transform

Lighting

Clip testing

Clipping state

Divide by w

(clipping)

Viewport

Prim. Assy.

Backface cull

Coordinate

Transform

6-plane

Frustum

Clipping

Divide by w

Viewport

Clark “Geometry Engine” (1983) SGI 4D/GTX(1988) SGI RealityEngine(1992)

Page 24: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 24/93

CS248 Lecture 14 Kurt Akeley, Fall 2007

Queueing

FIFO buffering (first-in, first-out) isprovided between task stages

Accommodates variation inexecution time

Provides elasticity to allow unifiedload balancing to work

FIFOs can also be unified

Share a single large memory withmultiple head-tail pairs

Allocate as required

Vertex assembly

Primitive assembly

Vertex operations

Application

FIFO

FIFO

FIFO

Page 25: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 25/93

CS248 Lecture 14 Kurt Akeley, Fall 2007

Data locality

Prior to texture mapping:

Vertex pipeline was a stream processor

Each work element (vertex, primitive,fragment) carried all the state it needed

Modal state was local to the pipelinestage

Assembly stages operated on adjacentwork elements

Data locality was inherent in this model

Post texture mapping:

All application-programmable stageshave memory access (and use them)So the vertex pipeline is no longer astream processor

Data locality must be fought for …

Vertex assembly

Primitive assembly

Rasterization

Fragment operations

Display

Vertex operations

Application

Primitive operations

Framebuffer

Page 26: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 26/93

CS248 Lecture 14 Kurt Akeley, Fall 2007

Post-texture mapping data locality

(simplified)

Modern memory (DRAM) operates in largeblocks

Memory is a 2-D array

Access is to an entire row

To make efficient use of memory bandwidthall the data in a block must be used

Two things can be done:

Aggregate read and write requests

Memory controller and cacheComplex part of GPU design

Organize memory contentscoherently (blocking)

Page 27: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 27/93

The nVidia G80 GPU► 128 streaming floating point processors @1.5Ghz► 1.5 Gb Shared RAM with 86Gb/s bandwidth► 500 Gflop on one chip (single precision)

Page 28: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 28/93

Entertainment Industry has driven theeconomy of these chips?

Males age 15-35 buy$10B in video games / year

Moore’s Law ++ Simplified design (stream processing)Single-chip designs.

Why are GPU’s so fast?

Page 29: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 29/93

Modern GPU has more ALU’s

Page 30: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 30/93

Page 31: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 31/93

nVidia G80 GPUArchitecture Overview

•16 Multiprocessors Blocks•Each MP Block Has:

•8 Streaming Processors(IEEE 754 spfpcompliant)

•16K Shared Memory

•64K Constant Cache

•8K Texture Cache

•Each processor can accessall of the memory at 86Gb/s,but with different latencies:

•Shared – 2 cycle latency

•Device – 300 cycle latency

Page 32: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 32/93

A Specialized Processor

Very Efficient ForFast Parallel Floating Point ProcessingSingle Instruction Multiple Data OperationsHigh Computation per Memory Access

Not As Efficient ForDouble Precision

Logical Operations on Integer DataBranching-Intensive OperationsRandom Access, Memory-Intensive Operations

Page 33: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 33/93

CS248 Lecture 14 Kurt Akeley, Fall 2007

Implementation = abstraction (from lecture 2)

L2

FB

SP SP

L1

TF

T h r e a

d P r o c e s s o r

Vtx Thread Issue

Setup / Rstr / ZCull

Prim Thread Issue Frag Thread Issue

Data Assembler

Application

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Vertex assembly

Primitive assembly

Rasterization

Fragment operations

Vertex operations

Application

Primitive operations

NVIDIA GeForce 8800 OpenGL Pipeline

Framebuffer

Source : NVIDIA

Page 34: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 34/93

CS248 Lecture 14 Kurt Akeley, Fall 2007

Correspondence (by color)

L2

FB

SP SP

L1

TF

T h r e a

d P r o c e s s o r

Vtx Thread Issue

Setup / Rstr / ZCull

Prim Thread Issue Frag Thread Issue

Data Assembler

Application

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Vertex assembly

Primitive assembly

Rasterization(fragment assembly)

Fragment operations

Vertex operations

Application

Primitive operations

NVIDIA GeForce 8800 OpenGL Pipeline

Framebuffer

this was missing

Application-programmable

parallelprocessor

Fixed-function assemblyprocessors

Fixed-functionframebufferoperations

Page 35: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 35/93

Page 36: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 36/93

CS248 Lecture 14 Kurt Akeley, Fall 2007

Texture Blocking

4x4 texelsCache Line SizeCache Size6D Organization

(s2,t2)(s1,t1) (s3,t3)

s1 t1 s2 t2 s3 t3baseAddress

4x4 blocks

Source: Pat Hanrahan

Page 37: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 37/93

Direct3D 10 System

andNVIDIA GeForce 8800 GPU

Page 38: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 38/93

OverviewBefore graphics-programming APIs were introduced, 3D applications issuedtheir commands directly to the graphics hardware

FastBecame infeasible with increasing graphics hardware

Graphics APIs like DirectX and OpenGL act as a middle layer between theapplication and the graphics hardwareUsing this model, applications write one set of code and the API does the jobof translating this code to instructions that can be understood by theunderlying hardware

A product of detailed collaboration among

Application developersHardware designers

API/runtime architects

Page 39: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 39/93

Problems with Earlier Versions

High state-change overheadChanging state (in terms of vertex formats, textures, shaders, shader parameters,blending modes etc.) incurs a high overhead

Excessive variation in hardware accelerator capabilitiesFrequent CPU and GPU synchronization

Generating new vertex data or building a cube map requires morecommunication, reducing efficiency

Instruction type and data type limitationsNeither vertex nor pixel shader supports integer instructions

Pixel shader accuracy for FP arithmetic can be improvedResource Limitations

The resources sizes were modest Algorithms had to be scaled back or broken into several passes

Page 40: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 40/93

Main Features of DirectX 10

Main objective is to reduce CPU overheadSome of the key changes are:

Faster and cleaner runtimeProgrammable pipeline is directed using a low-level abstraction layer

called Runtime. It hides the differences between varying applications andprovides device-independent resource management

The runtime of DirectX 10 has been redesigned to work more closely withthe graphics hardware

The GeForce 8800 architecture has been designed keeping in mind the

changes in Runtime Treatment of validation enhances performance

Page 41: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 41/93

Main Features of DirectX 10New Data Structure for Texture

Switching between multiple textures causes high state-change costDirectX 9 used to use texture atlas : but the approach was limited to4096x4096 and resulted in incorrect filtering at texture boundariesDirectX 10 uses texture array : up to 512 textures can be stored

sequentially Texture resolution extended to 8192x8192Maximum number of textures a shader can use is 128 (was 16)

The instructions handling this array are executed by GPUPredicted Draw

Complex objects are first drawn using a simple box approximation. Ifdrawing the box has no effect on the final image, the complex object isnot drawn at all. This is also known as an occlusion query

With DirectX 10, this process is done entirely on the GPU, eliminatingall CPU intervention

Page 42: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 42/93

Main Features of DirectX 10

Stream Out The vertex or geometry shader can output their resultsdirectly into graphics memory, bypassing the pixel shaderResult can be iteratively processed in the GPU only

State ObjectState management must be done in low costHuge range of states in DirectX 9 is consolidated into 5state objects: InputLayout, Sampler, Rasterizer,DepthStencil, BlendState changes that previously required multiple commandsnow need a single call

Page 43: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 43/93

Main Features of DirectX 10

Constant BuffersConstants are pre-defined values used as parameters in all shaderprogramsConstants often require updating to reflect world changes

Constant update produce significant API overheadDirectX 10 updates constants in batch mode

New HDR FormatsR11G11B10

RGBEOffer same dymamic range as FP16, but takes half storageMax limit is 32 bits per color component : 8800 supports this high-precision rendering

Page 44: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 44/93

Quick Comparison to DirectX 9

Page 45: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 45/93

The Pipeline

Input Assembler Vertex ShaderGeometry ShaderStream OutputSet-up andRasterization stagePixel ShaderOutput Merger

Page 46: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 46/93

A Simplified Diagram

Page 47: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 47/93

Relation to 8800 GPU

The pipeline can make efficient use of UnifiedShader Architecture of 8800 GPU

Page 48: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 48/93

8800 GPU Architecture

Page 49: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 49/93

Unified Shader Architecture

Page 50: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 50/93

Unified Shader Architecture

Fixed Shader Unified Shader

Page 51: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 51/93

Back to Pipeline : Input Assembler

Takes in 1D vertex data from upto 8 input streamsConverts data to a canonicalformatsupports a mechanism that allowsthe IA to effectively replicate anobject n times - instancing

Page 52: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 52/93

Vertex Shader

Used to transform vertices fromobject space to clip space.Reads a single vertex and producesa single vertex as output

VS and other programmable stagesshare a common feature set thatincludes an expanded set offloating-point, integer, control, andmemory read instructions allowingaccess to up to 128 memorybuffers (textures) and 16 parameter(constant) buffers - common core

Page 53: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 53/93

Geometry Shader

Takes the vertices of a single primi-tive (point, line segment, or triangle)as input and generates the vertices ofzero or more primitives

Triangles and lines are output as

connected strips of vertices Additional vertices can begenerated on-the-fly , allowingdisplacement mapping Geometry shader has the ability toaccess the adjacency information

This enables implementation of somenew powerful algorithms :

Realistic fur renderingNPR rendering

Page 54: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 54/93

Stream Output

Copies a subset of the vertex informa-tion to up to 4 1D output buffers insequential orderIdeally the output data format of SOshould be identical to the input dataformat of IABut practically SO writes 32 bit datatype while IA reads 8 or 16 bitData conversion and packing can be

implemented by a GS program

Page 55: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 55/93

Relation to 8800 GPU

Key to the GeForce 8800 architecture is the useof numerous scalar stream processors (SPs)Stream processors are highly efficient computing

engines that perform calculations on an inputstream and produces an output stream that canbe used by other stream processorsStream processors can be grouped in closeproximity, and in large numbers, to provideimmense parallel processing power.

Page 56: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 56/93

Stream Processing Architecture

Page 57: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 57/93

Unified FP Processor

Page 58: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 58/93

Set-up and Rasterization Stage

Input to this stage is verticesOutput from this stage is a series ofpixel fragmentsHandles following operations:

ClippingCullingPerspective divide

View port transformPrimitive set-up

ScissoringDepth offsetDepth processing like hierarchical-zFragment generation

Page 59: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 59/93

Pixel Shader

Input is a single pixel fragmentProduces a single outputfragment consisting of 1-8attribute values and an optionaldepth valueIf the fragment is supposed tobe rendered, its output to 8render targetsEach target represent a differentrepresentation of the scene

Page 60: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 60/93

Output Merger

Input is a fragment from thepixel shaderPerforms traditional stencil anddepth testingUses a single unifieddepth/stencil buffer to specifythe bind points for this bufferand 8 other render targets

Degree of multiple renderingenhanced to 8

Page 61: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 61/93

Shader Model 4.0

A hit t l Ch i

Page 62: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 62/93

Architectural Changes inShader Model 4.0

In previous models, each programmable stage ofthe pipeline used separate virtual machinesEach VM had its own

Instruction setGeneral purpose registersI/O registers for inter-stage communicationResource binding points for attaching memoryresources

A hit t l Ch g i

Page 63: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 63/93

Architectural Changes inShader Model 4.0

Direct3D 10 defines a single common core virtual machine as thebase for each of the programmable stagesIn addition to the previous resources, it also has:

32-bit integer (arithmetic, bitwise, and conversion)

instructionsUnified pool of general purpose and indexable registers(4096x4)Separate unfiltered and filtered memory read instructions(load and sample instructions)Decoupled texture bind points (128) and sampler state (16)Shadow map sampling support • multiple banks (16) ofconstant (parameter) buffers (4096x4)

Page 64: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 64/93

Diagram

Page 65: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 65/93

Advantages of Shader Model 4.0

VM is close to providing all of the arithmetic, logic and flow controlconstructs available on a CPUResources have been substantially increased to meet the market demand forseveral years

With increasing resource consumption, hardware implementations areexpected to degrade linearly, not fall rapidlyCan handle increase in constant storage as well as efficient update ofconstants

The observation that groups of constants are updated at different frequenciesSo they partition the constant store into different buffers

The data representation, arithmetic accuracy and behavior is more rigorouslyspecified – they follow IEEE 754 single precision floating pointrepresentation where it is possible

Page 66: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 66/93

Power of DirectX 10

Next Generation Effects Next-Generation InstancingPer-pixel Displacement Mapping

Procedural Growth Simulation

Page 67: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 67/93

Conclusions

A large step forwardParticularly geometry shader and stream outputshould become rich source of new ideasFuture work is directed to handle the growingbottleneck in content production

Page 68: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 68/93

Introduction to the graphicspipeline of the PS3

: : Cedric Perthuis

Page 69: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 69/93

Introduction

An overview of the hardware architecture with afocus on the graphics pipeline, and anintroduction to the related software APIs

Aimed to be a high level overview for academicsand game developers

No announcement and no sneak previews ofPS3 games in this presentation

Page 70: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 70/93

Outline

Platform OverviewGraphics Pipeline

APIs and tools

Cell Computing exampleConclusion

Page 71: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 71/93

Platform overview

Processing3.2Ghz Cell: PPU and 7 SPUs

PPU: PowerPC based, 2 hardware threadsSPUs: dedicated vector processing units

RSX®: high end GPUData flow

IO: BluRay, HDD, USB, Memory Cards, GigaBitethernetMemory: main 256 MB, video 256 MBSPUs, PPU and RSX® access main via shared busRSX® pulls from main to video

Page 72: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 72/93

Cell3.2 GHz

RSX® XDRAM256 MB

I/OBridge

HD/HDSD

AV out

20GB/s

15GB/s

25.6GB/s

2.5GB/s

2.5GB/s

BD/DVD/CDROM Drive

54GB USB 2.0 x 6

Gbit Ether/WiFi Removable StorageMemoryStick,SD,CF

BT Controller

GDDR3256 MB

22.4GB/s

PS3 Architecture

Page 73: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 73/93

Focus on the Cell SPUs

The key strength of the PS3Similar to PS2 Vector Units, but order of magnitudemore powerfulMain Memory Access via DMA: needs softwarecache to do generic processingProgrammable in C/C++ or assemblyPrograms: standalone executables or jobs

Ideal for sound, physics, graphics datapreprocessing, or simply to offload the PPU

Page 74: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 74/93

SPE 0

LS(256KB)

DMA

SPE 1

LS(256KB)

DMA

MICMemoryInterfaceController

XIO

SPE 2

LS(256KB)

DMA

SPE 3

LS(256KB)

DMA

SPE 4

LS(256KB)

DMA

SPE 5

LS(256KB)

DMA

SPE 6

LS(256KB)

DMA

PPEL1 (32 KB I/D)

L2(512 KB)

Flex-IO1

Flex-IO0

I/O

I/O

I/O

The Cell Processor

Page 75: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 75/93

The RSX® Graphics Processor

Based on a high end NVidia chipFully programmable pipeline: shader model 3.0Floating point render targetsHardware anti-aliasing ( 2x, 4x )256 MB of dedicated video memory

PULL from the main memory at 20 GB/sHD Ready (720p/1080p)

720p = 921 600 pixels1080p = 2 073 600 pixels a high end GPU adapted to work with the CellProcessor and HD displays

Page 76: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 76/93

The RSX® parallel pipeline

Command processingFifo of commands, flip and sync

Texture management

System or video memorystorage mode, compression

Vertex Processing Attribute fetch, vertex program

Fragment ProcessingZcull, Fragment program, ROP

Page 77: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 77/93

Xbox 360

512 MB system memoryIBM 3-way symmetric core processor

ATI GPU with embedded EDRAM12x DVDOptional Hard disk

Page 78: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 78/93

The Xbox 360 GPU

Custom silicon designed by ATiTechnologies Inc.500 MHz, 338 million transistors, 90nmprocessSupports vertex and pixel shader version3.0+

Includes some Xbox 360 extensions

Page 79: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 79/93

The Xbox 360 GPU

10 MB embedded DRAM (EDRAM) forextremely high-bandwidth render targets

Alpha blending, Z testing, multisample antialiasing

are all free (even when combined)Hierarchical Z logic and dedicated memoryfor early Z/stencil rejection

GPU is also the memory hub for the wholesystem22.4 GB/sec to/from system memory

Page 80: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 80/93

More About the Xbox 360 GPU

48 shader ALUs shared between pixeland vertex shading (unified shaders)

Each ALU can co-issue one float4 op and

one scalar op each cycleNon-traditional architecture16 texture samplers

Dedicated Branch instructionexecution

Page 81: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 81/93

More About the Xbox 360 GPU

2x and 4x hardware multi-sample anti-aliasing (MSAA)Hardware tessellator

N-patches, triangular patches, andrectangular patches

Can render to 4 render targets and a

depth/stencil buffer simultaneously

Page 82: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 82/93

GPU: Work Flow

Consumes instructions and data from acommand buffer

Ring buffer in system memoryManaged by Direct3D, user configurable size(default 2 MB)Supports indirection for vertex data, index data,shaders, textures, render state, and commandbuffers

Up to 8 simultaneous contexts in-flight atonceChanging shaders or render state is inexpensive,since a new context can be started up easily

Page 83: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 83/93

GPU: Work Flow

Threads work on units of 64 vertices orpixels at onceDedicated triangle setup, clipping, etc.

Pixels processed in 2x2 quadsBack buffers/render targets stored inEDRAM

Alpha, Z, stencil test, and MSAA expansion donein EDRAM module

EDRAM contents copied to systemmemory by ―resolve‖ hardware

Page 84: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 84/93

GPU: Operations Per Clock

Write 8 pixels or 16 Z-only pixels toEDRAM

With MSAA, up to 32 samples or 64 Z-onlysamples

Reject up to 64 pixels that failHierarchical Z testing

Vertex fetch sixteen 32-bit words fromup to two different vertex streams

Page 85: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 85/93

GPU: Operations Per Clock

16 bilinear texture fetches48 vector and scalar ALU operations

Interpolate 16 float4 shaderinterpolants32 control flow operations

Process one vertex, one triangleResolve 8 pixels to system memoryfrom EDRAM

Page 86: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 86/93

GPU: Hierarchical Z

Rough, low-resolution representationof Z/stencil buffer contents

Provides early Z/stencil rejection forpixel quads11 bits of Z and 1 bit of stencil per

block

h l

Page 87: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 87/93

GPU: Hierarchical Z

NOT tied to compressionEDRAM BW advantage

Separate memory buffer on GPUEnough memory for 1280x720 2x MSAA

Provides a big performance boost

when drawing complex scenesDraw opaque objects front to back

G

Page 88: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 88/93

GPU: Textures

16 bilinear texture samples per clock64bpp runs at half rate, 128bpp at quarter rateTrilinear at half rate

Unlimited dependent texture fetchingDXT decompression has 32 bit precision

Better than Xbox (16-bit precision)

GPU R l

Page 89: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 89/93

GPU: Resolve

Copies surface data from EDRAM to atexture in system memoryRequired for render-to-texture andpresentation to the screenCan perform MSAA sample averagingor resolve individual samplesCan perform format conversions andbiasing

Di 3D 9 Xb 360

Page 90: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 90/93

Direct3D 9+ on Xbox 360

Similar API to PC Direct3D 9.0Optimized for Xbox 360 hardware

No abstraction layers or drivers —it’s directto the metalExposes all Xbox 360 custom hardwarefeatures

New state enumsNew APIs for finer-grained control andcompletely new features

Di 3D 9 Xb 360

Page 91: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 91/93

Direct3D 9+ on Xbox 360

Communicates with GPU via acommand buffer

Ring buffer in system memoryDirect Command Buffer Playback support

Di 3D C d B ff

Page 92: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 92/93

Direct3D: Command Buffer

Ring buffer that allows the CPU to safely

send commands to the GPUBuffer is filled by CPU, and the GPUconsumes the data

CPU Write Pointer

GPU Read Pointer

Code

ExecutionDraw

Draw

Draw

Draw

Rendering

Sh d

Page 93: Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 93/93

Shaders

Two options for writing shadersHLSL (with Xbox 360 extensions)GPU microcode (specific to the Xbox 360GPU, similar to assembly but direct tohardware)

Recommendation: Use HLSLEasy to write and maintainReplace individual shaders with microcodeif performance analysis warrants it