Modern GPU Architecture

8/13/2019 Modern GPU Architecture

http://slidepdf.com/reader/full/modern-gpu-architecture 1/93

Modern GPU Architecture

CSE 694GGame Design and Project

Prof. Roger Crawfis



GPU vs CPU

A GPU is tailored for highly parallel operation while aCPU executes programs seriallyFor this reason, GPUs have many parallel executionunits and higher transistor counts, while CPUs havefew execution units and higher clockspeeds

A GPU is for the most part deterministic in itsoperation (though this is quickly changing)GPUs have much deeper pipelines (several thousand

stages vs 10-20 for CPUs)GPUs have significantly faster and more advancedmemory interfaces as they need to shift around a lotmore data than CPUs



The GPU pipeline

The GPU receives geometry informationfrom the CPU as an input and provides apicture as an outputLet’s see how that happens

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface



Host Interface

The host interface is the communication bridgebetween the CPU and the GPUIt receives commands from the CPU and also pullsgeometry information from system memoryIt outputs a stream of vertices in object space with alltheir associated information (normals, texturecoordinates, per vertex color etc)

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface



Vertex Processing

The vertex processing stage receives vertices from thehost interface in object space and outputs them inscreen spaceThis may be a simple linear transformation, or acomplex operation involving morphing effectsNormals, texcoords etc are also transformedNo new vertices are created in this stage, and novertices are discarded (input/output has 1:1 mapping)

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface



Triangle setup

In this stage geometry information becomes rasterinformation (screen space geometry is the input, pixelsare the output)Prior to rasterization, triangles that are backfacing orare located outside the viewing frustrum are rejectedSome GPUs also do some hidden surface removal atthis stage

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface



Triangle Setup (cont)

A fragment is generated if and only ifits center is inside the triangle

Every fragment generated has itsattributes computed to be theperspective correct interpolation of thethree vertices that make up thetriangle

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface



Fragment Processing

Each fragment provided by triangle setup is fed intofragment processing as a set of attributes (position,normal, texcoord etc), which are used to compute thefinal color for this pixelThe computations taking place here include texturemapping and math operationsTypically the bottleneck in modern applications

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface



Memory Interface

Fragment colors provided by the previous stage arewritten to the framebufferUsed to be the biggest bottleneck before fragmentprocessing took overBefore the final write occurs, some fragments arerejected by the zbuffer, stencil and alpha testsOn modern GPUs, z and color are compressed toreduce framebuffer bandwidth (but not size)

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface



Programmability in the GPU

Vertex and fragment processing, and now triangle set-up, are programmableThe programmer can write programs that are executedfor every vertex as well as for every fragmentThis allows fully customizable geometry and shadingeffects that go well beyond the generic look and feel ofolder 3D applications

hostinterface

vertexprocessing

trianglesetup

pixelprocessing

memoryinterface



Diagram of a modern GPU

64bits tomemory

64bits tomemory

64bits tomemory

64bits tomemory

Input from CPU

Host interface

Vertex processing

Triangle setup

Pixel processing

Memory Interface



CPU/GPU interaction

The CPU and GPU inside the PC work in parallel witheach otherThere are two ―threads‖ going on, one for the CPU andone for the GPU, which communicate through acommand buffer:

CPU writes commands here

GPU reads commands from here

Pending GPU commands



CPU/GPU interaction (cont)

If this command buffer is drained empty, we are CPUlimited and the GPU will spin around waiting for newinput. All the GPU power in the universe isn’t going tomake your application faster!If the command buffer fills up, the CPU will spinaround waiting for the GPU to consume it, and we areeffectively GPU limited



CPU/GPU interaction (cont)

Another important point to consider is that programsthat use the GPU do not follow the traditionalsequential execution modelIn the CPU program below, the object is not drawnafter statement A and before statement B:

Instead, all the API call does, is to add the commandto draw the object to the GPU command buffer

•Statement A• API call to draw object•Statement B



Synchronization issues

This leads to a number of synchronizationconsiderationsIn the figure below, the CPU must not overwrite thedata in the ―yellow‖ block until the GPU is done withthe ―black‖ command, which references that data:



data



Syncronization issues (cont)

Modern APIs implement semaphore style operationsto keep this from causing problemsIf the CPU attempts to modify a piece of data that isbeing referenced by a pending GPU command, it will

have to spin around waiting, until the GPU is finishedwith that commandWhile this ensures correct operation it is not good forperformance since there are a million other thingswe’d rather do with the CPU instead of spinning The GPU will also drain a big part of the commandbuffer thereby reducing its ability to run in parallel withthe CPU



Inlining data

One way to avoid these problems is to inline all data tothe command buffer and avoid references to separatedata:

However, this is also bad for performance, since wemay need to copy several Mbytes of data instead ofmerely passing around a pointer



data



Renaming data

A better solution is to allocate a new data block andinitialize that one instead, the old block will be deletedonce the GPU is done with itModern APIs do this automatically, provided you

initialize the entire block (if you only change a part ofthe block, renaming cannot occur)

Better yet, allocate all your data at startup and don’tchange them for the duration of execution (not alwayspossible, however)

data datadata data



GPU readbacks

The output of a GPU is a rendered image on thescreen, what will happen if the CPU tries to read it?

The GPU must be syncronized with the CPU, ie itmust drain its entire command buffer, and the CPUmust wait while this happens



Pending GPU commands



GPU readbacks (cont)

We lose all parallelism, since first the CPU waits forthe GPU, then the GPU waits for the CPU (becausethe command buffer has been drained)Both CPU and GPU performance take a nosedive

Bottom line: the image the GPU produces is for youreyes, not for the CPU (treat the CPU -> GPU highwayas a one way street)



Some more GPU tips

Since the GPU is highly parallel and deeply pipelined,try to dispatch large batches with each drawing callSending just one triangle at a time will not occupy allof the GPU’s several vertex/pixel processors, nor will itfill its deep pipelinesSince all GPUs today use the zbuffer algorithm to dohidden surface removal, rendering objects front-to-back is faster than back-to-front (painters algorithm),or random orderingOf course, there is no point in front-to-back sorting ifyou are already CPU limited



Graphics Hardware Abstraction

OpenGL and DirectX provide anabstraction of the hardware.



CS248 Lecture 14 Kurt Akeley, Fall 2007

Trend from pipeline to data parallelism

CommandProcessor

Round-robinAggregation

Coord, normal

Transform

Lighting

Clip testing

Clipping state

Divide by w

(clipping)

Viewport

Prim. Assy.

Backface cull

Coordinate

Transform

6-plane

Frustum

Clipping

Divide by w

Viewport

Clark “Geometry Engine” (1983) SGI 4D/GTX(1988) SGI RealityEngine(1992)




Queueing

FIFO buffering (first-in, first-out) isprovided between task stages

Accommodates variation inexecution time

Provides elasticity to allow unifiedload balancing to work

FIFOs can also be unified

Share a single large memory withmultiple head-tail pairs

Allocate as required

Vertex assembly

Primitive assembly

Vertex operations

Application

FIFO

FIFO

FIFO




Data locality

Prior to texture mapping:

Vertex pipeline was a stream processor

Each work element (vertex, primitive,fragment) carried all the state it needed

Modal state was local to the pipelinestage

Assembly stages operated on adjacentwork elements

Data locality was inherent in this model

Post texture mapping:

All application-programmable stageshave memory access (and use them)So the vertex pipeline is no longer astream processor

Data locality must be fought for …

Vertex assembly

Primitive assembly

Rasterization

Fragment operations

Display

Vertex operations

Application

Primitive operations

Framebuffer




Post-texture mapping data locality

(simplified)

Modern memory (DRAM) operates in largeblocks

Memory is a 2-D array

Access is to an entire row

To make efficient use of memory bandwidthall the data in a block must be used

Two things can be done:

Aggregate read and write requests

Memory controller and cacheComplex part of GPU design

Organize memory contentscoherently (blocking)



The nVidia G80 GPU► 128 streaming floating point processors @1.5Ghz► 1.5 Gb Shared RAM with 86Gb/s bandwidth► 500 Gflop on one chip (single precision)



Entertainment Industry has driven theeconomy of these chips?

Males age 15-35 buy$10B in video games / year

Moore’s Law ++ Simplified design (stream processing)Single-chip designs.

Why are GPU’s so fast?



Modern GPU has more ALU’s



nVidia G80 GPUArchitecture Overview

•16 Multiprocessors Blocks•Each MP Block Has:

•8 Streaming Processors(IEEE 754 spfpcompliant)

•16K Shared Memory

•64K Constant Cache

•8K Texture Cache

•Each processor can accessall of the memory at 86Gb/s,but with different latencies:

•Shared – 2 cycle latency

•Device – 300 cycle latency



A Specialized Processor

Very Efficient ForFast Parallel Floating Point ProcessingSingle Instruction Multiple Data OperationsHigh Computation per Memory Access

Not As Efficient ForDouble Precision

Logical Operations on Integer DataBranching-Intensive OperationsRandom Access, Memory-Intensive Operations




Implementation = abstraction (from lecture 2)

L2

FB

SP SP

L1

TF

T h r e a

d P r o c e s s o r

Vtx Thread Issue

Setup / Rstr / ZCull

Prim Thread Issue Frag Thread Issue

Data Assembler

Application

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Vertex assembly

Primitive assembly

Rasterization

Fragment operations

Vertex operations

Application


NVIDIA GeForce 8800 OpenGL Pipeline

Framebuffer

Source : NVIDIA




Correspondence (by color)

L2

FB

SP SP

L1

TF

T h r e a

d P r o c e s s o r

Vtx Thread Issue

Setup / Rstr / ZCull

Prim Thread Issue Frag Thread Issue

Data Assembler

Application

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

SP SP

L1

TF

L2

FB

L2

FB

L2

FB

L2

FB

L2

FB

Vertex assembly

Primitive assembly

Rasterization(fragment assembly)

Fragment operations

Vertex operations

Application


NVIDIA GeForce 8800 OpenGL Pipeline

Framebuffer

this was missing

Application-programmable

parallelprocessor

Fixed-function assemblyprocessors

Fixed-functionframebufferoperations




Texture Blocking

4x4 texelsCache Line SizeCache Size6D Organization

(s2,t2)(s1,t1) (s3,t3)

s1 t1 s2 t2 s3 t3baseAddress

4x4 blocks

Source: Pat Hanrahan



Direct3D 10 System

andNVIDIA GeForce 8800 GPU



OverviewBefore graphics-programming APIs were introduced, 3D applications issuedtheir commands directly to the graphics hardware

FastBecame infeasible with increasing graphics hardware

Graphics APIs like DirectX and OpenGL act as a middle layer between theapplication and the graphics hardwareUsing this model, applications write one set of code and the API does the jobof translating this code to instructions that can be understood by theunderlying hardware

A product of detailed collaboration among

Application developersHardware designers

API/runtime architects



Problems with Earlier Versions

High state-change overheadChanging state (in terms of vertex formats, textures, shaders, shader parameters,blending modes etc.) incurs a high overhead

Excessive variation in hardware accelerator capabilitiesFrequent CPU and GPU synchronization

Generating new vertex data or building a cube map requires morecommunication, reducing efficiency

Instruction type and data type limitationsNeither vertex nor pixel shader supports integer instructions

Pixel shader accuracy for FP arithmetic can be improvedResource Limitations

The resources sizes were modest Algorithms had to be scaled back or broken into several passes



Main Features of DirectX 10

Main objective is to reduce CPU overheadSome of the key changes are:

Faster and cleaner runtimeProgrammable pipeline is directed using a low-level abstraction layer

called Runtime. It hides the differences between varying applications andprovides device-independent resource management

The runtime of DirectX 10 has been redesigned to work more closely withthe graphics hardware

The GeForce 8800 architecture has been designed keeping in mind the

changes in Runtime Treatment of validation enhances performance



Main Features of DirectX 10New Data Structure for Texture

Switching between multiple textures causes high state-change costDirectX 9 used to use texture atlas : but the approach was limited to4096x4096 and resulted in incorrect filtering at texture boundariesDirectX 10 uses texture array : up to 512 textures can be stored

sequentially Texture resolution extended to 8192x8192Maximum number of textures a shader can use is 128 (was 16)

The instructions handling this array are executed by GPUPredicted Draw

Complex objects are first drawn using a simple box approximation. Ifdrawing the box has no effect on the final image, the complex object isnot drawn at all. This is also known as an occlusion query

With DirectX 10, this process is done entirely on the GPU, eliminatingall CPU intervention




Stream Out The vertex or geometry shader can output their resultsdirectly into graphics memory, bypassing the pixel shaderResult can be iteratively processed in the GPU only

State ObjectState management must be done in low costHuge range of states in DirectX 9 is consolidated into 5state objects: InputLayout, Sampler, Rasterizer,DepthStencil, BlendState changes that previously required multiple commandsnow need a single call




Constant BuffersConstants are pre-defined values used as parameters in all shaderprogramsConstants often require updating to reflect world changes

Constant update produce significant API overheadDirectX 10 updates constants in batch mode

New HDR FormatsR11G11B10

RGBEOffer same dymamic range as FP16, but takes half storageMax limit is 32 bits per color component : 8800 supports this high-precision rendering



Quick Comparison to DirectX 9



The Pipeline

Input Assembler Vertex ShaderGeometry ShaderStream OutputSet-up andRasterization stagePixel ShaderOutput Merger



A Simplified Diagram



Relation to 8800 GPU

The pipeline can make efficient use of UnifiedShader Architecture of 8800 GPU



8800 GPU Architecture



Unified Shader Architecture



Unified Shader Architecture

Fixed Shader Unified Shader



Back to Pipeline : Input Assembler

Takes in 1D vertex data from upto 8 input streamsConverts data to a canonicalformatsupports a mechanism that allowsthe IA to effectively replicate anobject n times - instancing



Vertex Shader

Used to transform vertices fromobject space to clip space.Reads a single vertex and producesa single vertex as output

VS and other programmable stagesshare a common feature set thatincludes an expanded set offloating-point, integer, control, andmemory read instructions allowingaccess to up to 128 memorybuffers (textures) and 16 parameter(constant) buffers - common core



Geometry Shader

Takes the vertices of a single primi-tive (point, line segment, or triangle)as input and generates the vertices ofzero or more primitives

Triangles and lines are output as

connected strips of vertices Additional vertices can begenerated on-the-fly , allowingdisplacement mapping Geometry shader has the ability toaccess the adjacency information

This enables implementation of somenew powerful algorithms :

Realistic fur renderingNPR rendering



Stream Output

Copies a subset of the vertex informa-tion to up to 4 1D output buffers insequential orderIdeally the output data format of SOshould be identical to the input dataformat of IABut practically SO writes 32 bit datatype while IA reads 8 or 16 bitData conversion and packing can be

implemented by a GS program



Relation to 8800 GPU

Key to the GeForce 8800 architecture is the useof numerous scalar stream processors (SPs)Stream processors are highly efficient computing

engines that perform calculations on an inputstream and produces an output stream that canbe used by other stream processorsStream processors can be grouped in closeproximity, and in large numbers, to provideimmense parallel processing power.



Stream Processing Architecture



Unified FP Processor



Set-up and Rasterization Stage

Input to this stage is verticesOutput from this stage is a series ofpixel fragmentsHandles following operations:

ClippingCullingPerspective divide

View port transformPrimitive set-up

ScissoringDepth offsetDepth processing like hierarchical-zFragment generation



Pixel Shader

Input is a single pixel fragmentProduces a single outputfragment consisting of 1-8attribute values and an optionaldepth valueIf the fragment is supposed tobe rendered, its output to 8render targetsEach target represent a differentrepresentation of the scene



Output Merger

Input is a fragment from thepixel shaderPerforms traditional stencil anddepth testingUses a single unifieddepth/stencil buffer to specifythe bind points for this bufferand 8 other render targets

Degree of multiple renderingenhanced to 8



Shader Model 4.0

A hit t l Ch i



Architectural Changes inShader Model 4.0

In previous models, each programmable stage ofthe pipeline used separate virtual machinesEach VM had its own

Instruction setGeneral purpose registersI/O registers for inter-stage communicationResource binding points for attaching memoryresources

A hit t l Ch g i



Architectural Changes inShader Model 4.0

Direct3D 10 defines a single common core virtual machine as thebase for each of the programmable stagesIn addition to the previous resources, it also has:

32-bit integer (arithmetic, bitwise, and conversion)

instructionsUnified pool of general purpose and indexable registers(4096x4)Separate unfiltered and filtered memory read instructions(load and sample instructions)Decoupled texture bind points (128) and sampler state (16)Shadow map sampling support • multiple banks (16) ofconstant (parameter) buffers (4096x4)



Diagram



Advantages of Shader Model 4.0

VM is close to providing all of the arithmetic, logic and flow controlconstructs available on a CPUResources have been substantially increased to meet the market demand forseveral years

With increasing resource consumption, hardware implementations areexpected to degrade linearly, not fall rapidlyCan handle increase in constant storage as well as efficient update ofconstants

The observation that groups of constants are updated at different frequenciesSo they partition the constant store into different buffers

The data representation, arithmetic accuracy and behavior is more rigorouslyspecified – they follow IEEE 754 single precision floating pointrepresentation where it is possible



Power of DirectX 10

Next Generation Effects Next-Generation InstancingPer-pixel Displacement Mapping

Procedural Growth Simulation



Conclusions

A large step forwardParticularly geometry shader and stream outputshould become rich source of new ideasFuture work is directed to handle the growingbottleneck in content production



Introduction to the graphicspipeline of the PS3

: : Cedric Perthuis



Introduction

An overview of the hardware architecture with afocus on the graphics pipeline, and anintroduction to the related software APIs

Aimed to be a high level overview for academicsand game developers

No announcement and no sneak previews ofPS3 games in this presentation



Outline

Platform OverviewGraphics Pipeline

APIs and tools

Cell Computing exampleConclusion



Platform overview

Processing3.2Ghz Cell: PPU and 7 SPUs

PPU: PowerPC based, 2 hardware threadsSPUs: dedicated vector processing units

RSX®: high end GPUData flow

IO: BluRay, HDD, USB, Memory Cards, GigaBitethernetMemory: main 256 MB, video 256 MBSPUs, PPU and RSX® access main via shared busRSX® pulls from main to video



Cell3.2 GHz

RSX® XDRAM256 MB

I/OBridge

HD/HDSD

AV out

20GB/s

15GB/s

25.6GB/s

2.5GB/s

2.5GB/s

BD/DVD/CDROM Drive

54GB USB 2.0 x 6

Gbit Ether/WiFi Removable StorageMemoryStick,SD,CF

BT Controller

GDDR3256 MB

22.4GB/s

PS3 Architecture



Focus on the Cell SPUs

The key strength of the PS3Similar to PS2 Vector Units, but order of magnitudemore powerfulMain Memory Access via DMA: needs softwarecache to do generic processingProgrammable in C/C++ or assemblyPrograms: standalone executables or jobs

Ideal for sound, physics, graphics datapreprocessing, or simply to offload the PPU



SPE 0

LS(256KB)

DMA

SPE 1

LS(256KB)

DMA

MICMemoryInterfaceController

XIO

SPE 2

LS(256KB)

DMA

SPE 3

LS(256KB)

DMA

SPE 4

LS(256KB)

DMA

SPE 5

LS(256KB)

DMA

SPE 6

LS(256KB)

DMA

PPEL1 (32 KB I/D)

L2(512 KB)

Flex-IO1

Flex-IO0

I/O

I/O

I/O

The Cell Processor



The RSX® Graphics Processor

Based on a high end NVidia chipFully programmable pipeline: shader model 3.0Floating point render targetsHardware anti-aliasing ( 2x, 4x )256 MB of dedicated video memory

PULL from the main memory at 20 GB/sHD Ready (720p/1080p)

720p = 921 600 pixels1080p = 2 073 600 pixels a high end GPU adapted to work with the CellProcessor and HD displays



The RSX® parallel pipeline

Command processingFifo of commands, flip and sync

Texture management

System or video memorystorage mode, compression

Vertex Processing Attribute fetch, vertex program

Fragment ProcessingZcull, Fragment program, ROP



Xbox 360

512 MB system memoryIBM 3-way symmetric core processor

ATI GPU with embedded EDRAM12x DVDOptional Hard disk



The Xbox 360 GPU

Custom silicon designed by ATiTechnologies Inc.500 MHz, 338 million transistors, 90nmprocessSupports vertex and pixel shader version3.0+

Includes some Xbox 360 extensions



The Xbox 360 GPU

10 MB embedded DRAM (EDRAM) forextremely high-bandwidth render targets

Alpha blending, Z testing, multisample antialiasing

are all free (even when combined)Hierarchical Z logic and dedicated memoryfor early Z/stencil rejection

GPU is also the memory hub for the wholesystem22.4 GB/sec to/from system memory



More About the Xbox 360 GPU

48 shader ALUs shared between pixeland vertex shading (unified shaders)

Each ALU can co-issue one float4 op and

one scalar op each cycleNon-traditional architecture16 texture samplers

Dedicated Branch instructionexecution



More About the Xbox 360 GPU

2x and 4x hardware multi-sample anti-aliasing (MSAA)Hardware tessellator

N-patches, triangular patches, andrectangular patches

Can render to 4 render targets and a

depth/stencil buffer simultaneously



GPU: Work Flow

Consumes instructions and data from acommand buffer

Ring buffer in system memoryManaged by Direct3D, user configurable size(default 2 MB)Supports indirection for vertex data, index data,shaders, textures, render state, and commandbuffers

Up to 8 simultaneous contexts in-flight atonceChanging shaders or render state is inexpensive,since a new context can be started up easily



GPU: Work Flow

Threads work on units of 64 vertices orpixels at onceDedicated triangle setup, clipping, etc.

Pixels processed in 2x2 quadsBack buffers/render targets stored inEDRAM

Alpha, Z, stencil test, and MSAA expansion donein EDRAM module

EDRAM contents copied to systemmemory by ―resolve‖ hardware



GPU: Operations Per Clock

Write 8 pixels or 16 Z-only pixels toEDRAM

With MSAA, up to 32 samples or 64 Z-onlysamples

Reject up to 64 pixels that failHierarchical Z testing

Vertex fetch sixteen 32-bit words fromup to two different vertex streams



GPU: Operations Per Clock

16 bilinear texture fetches48 vector and scalar ALU operations

Interpolate 16 float4 shaderinterpolants32 control flow operations

Process one vertex, one triangleResolve 8 pixels to system memoryfrom EDRAM



GPU: Hierarchical Z

Rough, low-resolution representationof Z/stencil buffer contents

Provides early Z/stencil rejection forpixel quads11 bits of Z and 1 bit of stencil per

block

h l



GPU: Hierarchical Z

NOT tied to compressionEDRAM BW advantage

Separate memory buffer on GPUEnough memory for 1280x720 2x MSAA

Provides a big performance boost

when drawing complex scenesDraw opaque objects front to back

G



GPU: Textures

16 bilinear texture samples per clock64bpp runs at half rate, 128bpp at quarter rateTrilinear at half rate

Unlimited dependent texture fetchingDXT decompression has 32 bit precision

Better than Xbox (16-bit precision)

GPU R l



GPU: Resolve

Copies surface data from EDRAM to atexture in system memoryRequired for render-to-texture andpresentation to the screenCan perform MSAA sample averagingor resolve individual samplesCan perform format conversions andbiasing

Di 3D 9 Xb 360



Direct3D 9+ on Xbox 360

Similar API to PC Direct3D 9.0Optimized for Xbox 360 hardware

No abstraction layers or drivers —it’s directto the metalExposes all Xbox 360 custom hardwarefeatures

New state enumsNew APIs for finer-grained control andcompletely new features

Di 3D 9 Xb 360



Direct3D 9+ on Xbox 360

Communicates with GPU via acommand buffer

Ring buffer in system memoryDirect Command Buffer Playback support

Di 3D C d B ff



Direct3D: Command Buffer

Ring buffer that allows the CPU to safely

send commands to the GPUBuffer is filled by CPU, and the GPUconsumes the data

CPU Write Pointer

GPU Read Pointer

Code

ExecutionDraw

Draw

Draw

Draw

Rendering

Sh d



Shaders

Two options for writing shadersHLSL (with Xbox 360 extensions)GPU microcode (specific to the Xbox 360GPU, similar to assembly but direct tohardware)

Recommendation: Use HLSLEasy to write and maintainReplace individual shaders with microcodeif performance analysis warrants it

Modern GPU Architecture

Documents

Transcript of Modern GPU Architecture