CS 354 GPU Architecture

CS 354GPU Architecture

Mark KilgardUniversity of TexasMarch 6, 2012

CS 354 2

Today’s material

In-class quiz Lecture topic

Architecture of Graphics Processing Units (GPUs)

Course work Homework #4 due today Review textbook reading

Chapter 5, 6, and 7

Project #2 on texturing, shading, & lighting is coming Remember: Midterm in-class on March 8

CS 354 3

My Office Hours

Tuesday, before class Painter (PAI) 5.35 8:45 a.m. to 9:15

Thursday, after class ACE 6.302 11:00 a.m. to 12

Randy’s office hours Monday & Wednesday 11 a.m. to 12:00 Painter (PAI) 5.33

CS 354 4

Last time, this time

Last lecture, we discussed Programmable shading Graphics hardware shading languages

This lecture How do GPUs work?

CS 354 5

Daily Quiz

1. Pick the best choice: Shade trees area) fractal trees with shadowsb) OpenGL commands c) hierarchical arrangements of shading computationsd) fractal patterns of all sorts

2. Name one general purpose programming language that GLSL borrows from.

3. Multiple choice: The GLSL standard has built-in data types fora) vectorsb) matricesc) texture samplersd) floating-point valuese) pointers to malloc’ed memoryf) a through eg) a through d

On a sheet of paper• Write your EID, name, and date• Write #1, #2, #3, #4 followed by its answer

CS 354 6

Key Trend in OpenGL Evolution

Direct3D follows the same trend Also reflects trend in GPU architecture

API and hardware co-evolving

Fixed-function Programmable

SimpleConfigurability

ComplexConfigurability

Shaders!

High-level languages

CS 354 7

Programming Shaders inside GPU

Multiple programmable domains within the GPU Can be programmed in high-level languages

Cg, HLSL, or OpenGL Shading Language (GLSL)

GeometryProgram

3D Applicationor Game

OpenGL API

GPUFront End

VertexAssembly

VertexShader

Clipping, Setup,and Rasterization

FragmentShader

Texture Fetch

RasterOperations

Framebuffer Access

Memory Interface

CPU – GPU Boundary

OpenGL 3.3

Attribute Fetch

PrimitiveAssembly

Parameter Buffer Readprogrammable

fixed-function

Legend

CS 354 8

Complex OpenGL Data Flow

CS 354 9

Six Years of GPU ArchitectureProduct New Features OpenGL

VersionDirect3D Version

2000 GeForce 256Hardware transform & lighting, configurable fixed-point shading, cube maps, texture compression, anisotropic texture filtering

1.3 DX7

2001 GeForce3

Programmable vertex transformation, 4 texture units, dependent textures, 3D textures, shadow maps, multisampling, occlusion queries

1.4 DX8

2002 GeForce4 Ti 4600 Early Z culling, dual-monitor 1.4 DX8.1

2003 GeForce FX

Vertex program branching, floating-point fragment programs, 16 texture units, limited floating-point textures, color and depth compression

1.5 DX9

2004 GeForce 6800 Ultra

Vertex textures, structured fragment branching, non-power-of-two textures, generalized floating-point textures, floating-point texture filtering and blending

2.0 DX9c

2005 GeForce 7800 GTX Transparency antialiasing 2.0 DX9c

CS 354 10

0

200

400

600

800

1,000

1,200

1,400

GeForce2GTS

GeForce3 GeForce4 Ti4600

GeForce FX GeForce6800 Ultra

GeForce7800 GTX

GeForce PeakVertex Processing Trends

Vertex units 1 1 2 3 6 8

Mill

ions

of

vert

ices

per

sec

ond

rate for trivial 4x4vertex transform

exceeds peaksetup rates—allowsexcess vertex processing

CS 354 11

0

20

40

60

80

100

120

140

160

180

200

GeForce2GTS



GeForce7800 GTX

Rawbandwidth

Effective rawbandwidthwithcompression

Expon.(Effective rawbandwidthwithcompression)Expon. (Rawbandwidth)

GeForce PeakMemory Bandwidth Trends

128-bit interface 256-bit interface

Gig

abyt

es p

er s

econ

d

CS 354 12

Effective GPUMemory Bandwidth

Compression schemes Lossless depth and color (when multisampling)

compression Lossy texture compression (S3TC / DXTC) Typically assumes 4:1 compression

Avoidance useless work Early killing of fragments (Z cull) Avoiding useless blending and texture fetches

Very clever memory controller designs Combining memory accesses for improved coherency Caches for texture fetches

CS 354 13

0

200

400

600

800

1,000

1,200

1,400

Coreclock

Memoryclock

GeForce Core and Memory Clock RatesGeForce Core and Memory Clock Rates

DDR memorytransition—memory rates double physical clock rate

Meg

aher

tz (

Mhz

)

CS 354 14

0

50

100

150

200

250

300

GeForce2GTS



GeForce7800 GTX

GeForce PeakTriangle Setup Trends

Mill

ions

of

tria

ngle

s pe

r se

cond

assumes 50%face culling

CS 354 15

0

2,000

4,000

6,000

8,000

10,000

12,000

GeForce2GTS



GeForce7800 GTX

GeForce PeakTexture Fetch Trends

Mill

ions

of

text

ure

fetc

hes

per

seco

nd

Texture units 2×4 2×4 2×4 2×4 16 24

assuming no texture cache misses

CS 354 16

0

2,000

4,000

6,000

8,000

10,000

12,000

14,000

16,000

18,000

GeForce2GTS



GeForce7800 GTX

GeForce PeakDepth/Stencil-only Fill

assuming no read-modify-write

Mill

ions

of

dept

h/st

enci

l pix

el u

pdat

espe

r se

cond

Raster Op units 4 4 4 4+4 16+16 16+16

double speeddepth-stencil only

CS 354 17

0

50

100

150

200

250

300

350

400

450

Riva ZX RivaTNT2

GeForce2GTS

GeForce3 GeForce4Ti 4600

GeForceFX

GeForce6800Ultra

GeForce7800 GTX

GeForce Transistor Count and Semiconductor Process

Mill

ions

of

tran

sist

ors

Process (µm) 0.35 0.22 0.18 0.18 0.15 0.13 0.13 0.11

CS 354 18

16

GeForceFX 5900

GeForce6800 Ultra

Vertex

Fragment

2nd TextureFetch

3 6

4+4

Raster Color

Raster Depth

4+4 16+16

GeForce7800 GTX

HardwareUnit

8

24

16+16

CS 354 19

GeForce 7800 GTXBoard Details

256MB/256-bit DDR3 600 MHz8 pieces of 8Mx3216x PCI-Express

SLI Connector

DVI x 2

sVideoTV Out

Single slot cooling

CS 354 20

GeForce 7800 GTXGPU Details

302 million transistors430 MHz core clock256-bit memory interface

Notable Functionality• Non-power-of-two textures with mipmaps• Floating-point (fp16) blending and filtering• sRGB color space texture filtering and

frame buffer blending• Vertex textures• 16x anisotropic texture filtering• Dynamic vertex and fragment branching• Double-rate depth/stencil-only rendering• Early depth/stencil culling• Transparency antialiasing

CS 354 21

Triangle Setup/RasterTriangle Setup/Raster

Shader Instruction DispatchShader Instruction Dispatch

Fragment CrossbarFragment Crossbar

MemoryPartitionMemoryPartition




Z-CullZ-Cull

8 Vertex Engines

24 Fragment Shaders

16 Raster Operation Pipelines

GeForce 7800 GTX ParallelismGeForce 7800 GTX Parallelism

CS 354 22

GeForce Graphics Pipeline

CPUVertexEngine Setup Raster

FragmentShader

Texture

RasterOps

FrameBuffer

Z Cull

Separate dedicated units

CS 354 23

GeForce Graphics PipelineVertex Engine


FragmentShader

Texture

RasterOps

FrameBuffer

Z Cull

Vertex pulling

Vector floating-point instructions

Dynamic branching

Vertex texture

Vertex stream frequency

CS 354 24

GeForce Graphics PipelineSetup


FragmentShader

Texture

RasterOps

FrameBuffer

Z Cull

Prepare triangle for rasterization

215M triangles/sec setup

CS 354 25

GeForce Graphics PipelineRaster


FragmentShader

Texture

RasterOps

FrameBuffer

Z Cull

Compute coverage

Points, lines, and triangles

Rotated grid multisampling

CS 354 26

GeForce Graphics PipelineZ Cull


FragmentShader

Texture

RasterOps

FrameBuffer

Z Cull

Discard fragments early based on Z

Up to 64 pixels/clock

Multisampled: 256 samples/clock

CS 354 27

GeForce Graphics PipelineFragment Shader


FragmentShader

Texture

RasterOps

FrameBuffer

Z Cull

User-programmed fragment coloring

Dynamic branching

Long shaders

Multiple render targets

fp16 and fp32 vectors

CS 354 28

GeForce Graphics PipelineTexture


FragmentShader

Texture

RasterOps

FrameBuffer

Z Cull

fp16 and sRGB filtering

16x anisotropic filtering

Non-power-of-two mipmapping

Shadow maps, cube maps, and 3D

Floating-point textures

CS 354 29

GeForce Graphics PipelineTexture


FragmentShader

Texture

RasterOps

FrameBuffer

Z Cull

2x and 4x multisampling

fp16 and sRGB blending

Multiple render targets

Color and depth compression

Double-speed depth/stencil only

CS 354 30

Single GeForce 7800Vertex Unit

FP32 Vector

Unit

To Setup

FP32 ScalarUnit

Viewport Processing

BranchUnit

VertexTextureFetch

Texture Cache

Primitive Assembly +Attribute Processing

Vertex Processing Engine• MIMD Architecture• Dual Issue• Low-penalty branching• Shader Model 3.0• 32 vector registers• 512 static instructions per

program• Indexed input and output

registers

Vertex Texture Fetch• Non-stalling• Up to 4 texture units

• Unlimited fetches• Mipmapping, no filtering

CS 354 31

Vertex Texturing Example

Flat tessellated mesh Displaced meshHeight field

texture

VertexProgram

CS 354 32

Vertex Textures for Dynamic Displacement Mapping

Images used with permission from Pacific Fighters. © 2004 Developed by 1C:Maddox Games. All rights reserved. © 2004 Ubi Soft Entertainment.

Without Vertex Textures With Vertex Textures

CS 354 33

Vertex Textures to Drive Particle Systems

Render-to-texture Simulation runs

in floating-pointframe buffer, alsousable as texture

Vertex textures Determines particle

location withvertex texturefetch

CS 354 34

Single GeForce 7800Fragment Shader Pipeline

Texture Processor

Texture Cache

BranchProcessor

FP32 ShaderUnit 1

FP32 ShaderUnit 2

Input Fragment Data

Output ShadedFragments

Fixed-functionFog Unit

TextureData

Texture Processor16 texture units1 texture fetch at full speedBilinear or tri-linear filtering16x anisotropic filteringFloating-point (fp16) texture filtering

Shader Unit 14 MULs + RCPDual IssueTexture address calculationFast fp16 normalizeFree: negate, abs, condition codes

Shader Unit 24 MADs or DP4 Dual IssueFree: negate, abs, condition codes

CS 354 35

Operations Per Fragment Shader Pass

ShaderShaderUnit 1Unit 1

TextureTexture

ShaderShaderUnit 2Unit 2

4 Components1 Op / component4 ops / fragment

per pass

4 Components1 Op / component4 Ops / fragment

per pass

1 Texture / fragment at full speed per pass

8 Operations / fragment per pass

or

CS 354 36

Fragment ShaderComponent Co-issue

Use 4 components various ways RGBA all together RGB and A RG and GB

Both shader units Two operations

per shader unitR G B A

Operation 3 Operation 4

R G B A

Operation 1 Operation 2

ShaderUnit 1

ShaderUnit 2

CS 354 37

Single GeForce 7800Raster Operations Pipeline

Input Shaded

Fragment Data

Pixel CrossbarInterconnect

Multisample Antialiasing

DepthCompression

DepthRaster

Operations

ColorCompression

ColorRaster

Operations

Frame Buffer PartitionMemory

Functionality• OpenEXR

floating-point blending

• sRGB blending

• 4x rotated grid multisampling

• Lossless color and depth compression

• Multiple render targets

CS 354 38

GeForce 7800Transparency Antialiasing

Conventional 4x antialiasingwith alpha tested context

Transparency antialiasingwith alpha tested context

CS 354 39

Scalable Link Interface (SLI)

Gang two GeForce 6600, 6800, or 7800 graphics boards together Can almost double your performance

Two 6800 Ultraspictured

SLIConnector

CS 354 40

SLI Rendering Modes

Split Frame Rendering (SFR) One GPU renders top of screen; other renders the bottom Scales fragment processing but not vertex processing

Alternate Frame Rendering (AFR) Scales both vertex and fragment processing Adds frame latency Rendering must be free of CPU synchronization

SLI Antialiasing: SLI8x and SLI16x Better antialiasing quality rather than performance Each card renders with slightly different sub-pixel offset

CS 354 41

PC Graphics Hardware Evolution

1997 2000 2005 2010

RIVA 128RIVA 1283M 3M

transistorstransistors

GeForceGeForce®® 25625623M 23M

transistorstransistors

GeForce GeForce FXFX

125M 125M transistorstransistors

GeForce GeForce 88008800681M 681M

transistorstransistorsGeForce GeForce 3 3

60M 60M transistortransistor

ss

GeForce GeForce 580 GTX580 GTX

3B3B transistorstransistors

Viable economics: 650 million GeForce GPUs since Viable economics: 650 million GeForce GPUs since 19991999

1,000x complexity since 19951,000x complexity since 1995

Moore’s Law at workMoore’s Law at work

CS 354 42

Current High-end “Fermi” GPU

Current high-end graphics card 512 graphics “cores” 1.5Gb memory System power: 600W OpenGL 4.2 / DirectX 11

functionality

CS 354 43

High-level “Fermi” Architecture GF100 Four Graphics

Processor Clusters (GPCs) Each is self-

contained graphics pipeline

Smaller chips have fewer GPcs

Shared L2 cache 6 Memory

Controllers 1.5 Gb

CS 354 44

Inside EachGraphics Processing Cluster

Raster engine Four SMs

Streaming Multiprocessor

Texture fetch resources

Tessellation and vertex processing resources Polymorph

Engine

CS 354 45

StreamingMultiprocessor (SM)

Multi-processor execution unit 32 scalar processor

cores Warp is a unit of

thread execution of up to 32 threads

Two workloads Graphics

Vertex shader Tessellation Geometry shader Fragment shader

Compute

CS 354 46

PrimitiveProgram

OpenGL Pipeline Programmable Domains run on Unified Hardware

Unified Streaming Processor Array (SPA) architecture means same capabilities for all domains Plus tessellation + compute (not shown below)

GPUFront End

VertexAssembly

VertexProgram

,Clipping, Setup,

and Rasterization

FragmentProgram

Texture Fetch

RasterOperations

Framebuffer Access

Memory Interface

Attribute Fetch

PrimitiveAssembly

Parameter Buffer Read

Can beunifiedhardware!

CS 354 47

Dual Warp Scheduling

32 threads launch!

CS 354 48

Shader or CUDA Core,Same Unit but Two Personalities

Execution unit Scalar floating-point Scalar integer

CS 354 49

Levels of Caching in Fermi GPU

12 KB L1 Texture cache Per texture unit

SM 64 K cache Split into dedicated 16K or 48K

Load/Store cache Shared memory 48K or 16K

L2 unifies texture cache, raster operation cache, and internal buffering in prior generation 768 K Read / write Fully coherent

CS 354 50

Cache Use Strategiesin Fermi GPU

Pipeline stages can communicate efficiently through GPU’s L1 and L2 caches Buffering between stages stays all on chip Only vertex, texel, and pixel read/writes need to go to DRAM

CS 354 51

Vertex and Tessellation Processing Tasks

Fixed-function graphics engines Pull attributes and assemble vertex Manage tessellation control and domain shader evaluation Viewport transform Attribute setup of plane equations for rasterization Stream out vertices into buffers

CS 354 52

Rasterization Tasks

Turns primitives into fragments Computes edge equations Two-stage rasterization

Coarse raster finds tiles the primitive could be in Fine raster evaluates sample positions within tiles

Zcull efficiently eliminates occluded fragments

CS 354 54


Apply Phong Tessellation

CS 354 55

Apply Displacement Mapping


Add Displacement Mapping

CS 354 56

Workstations2 to 4 Tesla

GPUs

Integrated CPU-GPU Servers & Blades

OEM CPU Server +Compute 1U

GPUs as Compute Nodes

Architecture of GPU has evolved into a high-performance, high-bandwidth compute node

Small form factorCompute

CS 354 57

Compute Programming Model

Cooperative Thread Array (CTA) Single Program, Multiple Data Organized in 1D, 2D, or 3D

Programming APIs CUDA, OpenCL, DirectCompute

APIs + language = parallel processing system

OpenGL or Direct3D through shaders Cg, HLSL, GLSL

CS 354 58

Now in World’s Fastest Supercomputers

Tianhe-1A

2.507 Petaflop

7168 Tesla M2050 GPUs

National Supercomputing Center in Tianjin

CS 354 59

Opposite direction:Consumer mobile devices

CS 354 60

Low-power MobileSystem on a Chip (SoC)

Complete system on a chip 4 ARM cores Integrated graphics

OpenGL ES 2.0

Power <1W

CS 354 61

Mid-term Next Class

Mid-term Similar in format to the homeworks 15% of your final grade Arrive on time Open textbook. Open notes, including lecture slides. Calculators allowed/encouraged. No smart phones, no computers, no Internet access. Show your work to justify your answer and provide a basis for partial

credit. What to study

All material in lecture slides Review in-class quiz questions Study homeworks Responsible for textbook readings

Have a relaxing spring break Next lecture: Shadows Come back to Project 2

CS 354 GPU Architecture

Technology

Transcript of CS 354 GPU Architecture