Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader...

49
1 Winter 2011 Beyond Programmable Shading Introduction to Parallel Programming For Real-Time Graphics (CPU + GPU) Aaron Lefohn, Intel / University of Washington Mike Houston, AMD / Stanford

Transcript of Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader...

Page 1: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

1Winter 2011 – Beyond Programmable Shading

Introduction to Parallel ProgrammingFor Real-Time Graphics

(CPU + GPU)Aaron Lefohn, Intel / University of Washington

Mike Houston, AMD / Stanford

Page 2: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

2Winter 2011 – Beyond Programmable Shading

What’s In This Talk?

• Overview of parallel programming models used in real-time graphics products and research

– Abstraction, execution, synchronization

– Shaders, task systems, conventional threads, graphics pipeline, “GPU” compute languages

• Parallel programming models

– Vertex shaders

– Conventional thread programming

– Task systems

– Graphics pipeline

– GPU compute languages (ComputeShader, OpenCL, CUDA)

• Discuss

– Strengths/weaknesses of different models

– How each model maps to the architectures

Page 3: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

3Winter 2011 – Beyond Programmable Shading

What Goes into a Game Frame? (2 years ago)

Page 4: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

4Winter 2011 – Beyond Programmable Shading

Computation graph for Battlefied: Bad Company provided by DICE

Page 5: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

5Winter 2011 – Beyond Programmable Shading

Data Parallelism

Page 6: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

6Winter 2011 – Beyond Programmable Shading

Task Parallelism

Page 7: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

7Winter 2011 – Beyond Programmable Shading

Graphics Pipelines

PipelineFlow

Input Assembly

Vertex Shading

Primitive Setup

Geometry Shading

Rasterization

Pixel Shading

Output Merging

Page 8: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

8Winter 2011 – Beyond Programmable Shading

Remember: “Our Enthusiast Chip”

Figure by Kayvon Fatahalian

Page 9: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

9Winter 2011 – Beyond Programmable Shading

Hardware Resources (from Kayvon’s Talk)

• Core

• Execution Context

• SIMD functional units

• On-chip memory

Figure by Kayvon Fatahalian

Page 10: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

10Winter 2011 – Beyond Programmable Shading

Abstraction

• Abstraction enables portability and system optimization

– E.g., dynamic load balancing, producer-consumer, SIMD utilization

• Lack of abstraction enables arch-specific user optimization

– E.g., multiple execution contexts jointly building on-chip data structure

• Remember:

– When a parallel programming model abstracts a HW resource, code written in that

programming model scales across architectures with varying amounts of that resource

Page 11: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

11Winter 2011 – Beyond Programmable Shading

Definitions: Execution

• Task– A logically related set of instructions executed in a single execution context

(aka shader, instance of a kernel, task)

• Concurrent execution

– Multiple tasks that may execute simultaneously

(because they are logically independent)

• Parallel execution

– Multiple tasks whose execution contexts are guaranteed to be live simultaneously

(because you want them to be for locality, synchronization, etc)

Page 12: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

12Winter 2011 – Beyond Programmable Shading

Synchronization

• Synchronization

– Restricting when tasks are permitted to execute

• Granularity of permitted synchronization determines at which granularity system allows user to control scheduling

Page 13: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

13Winter 2011 – Beyond Programmable Shading

Vertex Shaders: “Pure Data Parallelism”

• Execution

– Concurrent execution of identical per-vertex tasks

• What is abstracted?

– Cores, execution contexts, SIMD functional units, memory hierarchy

• What synchronization is allowed?

– Between draw calls

Page 14: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

14Winter 2011 – Beyond Programmable Shading

Shader (Data-parallel) Pseudocode

concurrent_for( i = 0 to numVertices - 1)

{

… execute vertex shader …

}

• SPMD = Single Program Multiple Data

– This type of programming is sometimes called SPMD

– Instance the same program multiple times and run on different data

– Many names: shader-style, kernel-style, SPMD

Page 15: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

15Winter 2011 – Beyond Programmable Shading

Conventional Thread Parallelism

(e.g., pthreads)

• Execution

– Parallel execution of N tasks with N execution contexts

• What is abstracted?

– Nothing (ignoring preemption)

• Where is synchronization allowed?

– Between any execution context at various granularities

Page 16: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

16Winter 2011 – Beyond Programmable Shading

Conventional Thread Parallelism

• Directly program:

– N execution contexts

– N SIMD ALUs / execution context

– …

• To use SIMD ALUs:

__m128 a_line, b_line, r_line;

r_line = _mm_mul_ps(a_line, b_line);

• Powerful, but dangerous...

Figure by Kayvon Fatahalian

Page 17: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

17Winter 2011 – Beyond Programmable Shading

Game Workload Example

Page 18: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

18Winter 2011 – Beyond Programmable Shading

Typical Game Workload

• Subsystems given % of overall time “budget”

• Input, Miscellaneous: 5%

• Physics: 30%

• AI, Game Logic: 10%

• Graphics: 50%

• Audio: 5%

• GPU Workload:

I AAIPhysics Graphics

“Rendering”

Slide by Tim Foley

Page 19: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

19Winter 2011 – Beyond Programmable Shading

• Assign each subsystems to a SW thread

• Problems

– Communication/synchronization

– Load imbalance

– Preemption leads to thrashing

• Don’t do this!

thread 2

thread 3

Parallelism Anti-Pattern #1

I

AI

Physics

Graphics

I

AI

Physics

Graphics

thread 0

thread 1

frame N

Slide by Tim Foley

Page 20: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

20Winter 2011 – Beyond Programmable Shading

Parallelism Anti-Pattern #2

• Group subsystems into HW threads

• Problems

– Communication/synchronization

– Load imbalance

– Poor scalability (4, 8, … HW threads)

• Don’t do this either!

I AAIPhysics

Graphics

I AAIPhysics

Graphics

thread 0

thread 1

frame N

Slide by Tim Foley

Page 21: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

21Winter 2011 – Beyond Programmable Shading

Better Solution: Find Concurrency…

• Identify where ordering constraints are needed and run concurrently between constraints

• Visualize as a graph

I AAIP P P P P G G G G G G G G G G

I

A

AI

P

P

P

P

P

G

G

G

G

G

G

G

G

G

G

Slide by Tim Foley

Page 22: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

22Winter 2011 – Beyond Programmable Shading

…And Distribute Work to Threads

• Dynamically distribute medium-grained concurrent tasks to hardware threads

• (Virtualize/abstract the threads”

thread 3

thread 2

thread 1

thread 0 A A A P P GG

AA A A P P G GG

AA PP G GG

P P PP PPP G G G G

Slide by Tim Foley

Page 23: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

23Winter 2011 – Beyond Programmable Shading

“Task Systems” (Cilk, TBB, ConcRT, GCD, …)

• Execution

– Concurrent execution of many (likely different) tasks

• What is abstracted?

– Cores and execution contexts

– Does not abstract: SIMD functional units or memory hierarchy

• Where is synchronization allowed?

– Between tasks

Page 24: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

24Winter 2011 – Beyond Programmable Shading

Mental Model: Task Systems

• Think of task as asynchronous function call

– “Do F() at some point in the future…”

– Optionally “… after G() is done”

• Can be implemented in HW or SW

– Launching/spawning task is nearly as fast as function call

– Usage model: “Spawn 1000s of tasks and let scheduler map tasks to execution contexts”

• Usually cooperative, not preemptive

– Task runs to completion – no blocking

– Can create new tasks, but not wait

F()

G()

Slide by Tim Foley

Page 25: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

25Winter 2011 – Beyond Programmable Shading

Task Parallel Code (Cilk)

void myTask(…some arguments…)

{

}

void main()

{

for( i = 0 to NumTasks - 1 )

{

cilk_spawn myTask(…);

}

cilk_sync;

}

Page 26: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

26Winter 2011 – Beyond Programmable Shading

Task Parallel Code (Cilk)

void myTask(…some arguments…)

{

}

void main()

{

cilk_for( i = 0 to NumTasks - 1 )

{

myTask(…);

}

cilk_sync;

}

Page 27: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

27Winter 2011 – Beyond Programmable Shading

Nested Task Parallel Code (Cilk)void barTask(…some parameters…)

{

}

void fooTask(…some parameters…)

{

if (someCondition) {cilk_spawn barTask(…);

}else {

cilk_spawn fooTask(…);}

// Implicit cilk_sync at end of function

}

void main()

{

cilk_for( i = 0 to NumTasks - 1 ) {

fooTask(…);

}

cilk_sync;

… More code …

}

Page 28: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

28Winter 2011 – Beyond Programmable Shading

“Task Systems” Review

• Execution

– Concurrent execution of many (likely different) tasks

• What is abstracted?

– Cores and execution contexts

– Does not abstract: SIMD functional units or memory hierarchy

• Where is synchronization allowed?

– Between tasks

Figure by Kayvon Fatahalian

Page 29: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

29Winter 2011 – Beyond Programmable Shading

DirectX/OpenGL Rendering Pipeline

(Combination of multiple models)

• Execution

– Data-parallel concurrent execution of identical task within each shading stage

– Task-parallel concurrent execution of different shading stages

– No parallelism exposed to user

• What is abstracted?

– (just about everything)

– Cores, execution contexts, SIMD functional units, memory hierarchy, and fixed-function graphics units (tessellator, rasterizer, ROPs, etc)

• Where is synchronization allowed?

– Between draw calls

Page 30: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

30Winter 2011 – Beyond Programmable Shading

GPU Compute Languages

(Combination of Multiple Models)• DX11 DirectCompute

• OpenCL

• CUDA

• There are multiple possible usage models. We’ll start with the “text

book” hierarchical data-parallel usage model

Page 31: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

31Winter 2011 – Beyond Programmable Shading

GPU Compute Languages

• Execution

– Hierarchical model

– Lower level is parallel execution of identical tasks (work-items) within work-group

– Upper level is concurrent execution of identical work-groups

• What is abstracted?

– Work-group abstracts a core’s execution contexts, SIMD functional units

– Set of work-groups abstracts cores

– Does not abstract core-local memory

• Where is synchronization allowed?

– Between work-items in a work-group

– Between “passes” (set of work-groups)

Page 32: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

32Winter 2011 – Beyond Programmable Shading

GPU Compute Pseudocode

void myWorkGroup()

{

parallel_for(i = 0 to NumWorkItems - 1)

{

… GPU Kernel Code … (This is where you write GPU compute code)

}

}

void main()

{

concurrent_for( i = 0 to NumWorkGroups - 1)

{

myWorkGroup();

}

sync;

}

Page 33: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

33Winter 2011 – Beyond Programmable Shading

DX CS/OCL/CUDA Execution Model

• Fundamental unit is work-item

– Single instance of “kernel” program (i.e., “task” using the definitions in this talk)

– Each work-item executes in single SIMD lane

• Work items collected in work-groups

– Work-group scheduled on single core

– Work-items in a work-group

– Execute in parallel

– Can share R/W on-chip scratchpad memory

– Can wait for other work-items in work-group

• Users launch a grid of work-groups

– Spawn many concurrent work-groups

void f(...) {

int x = ...;

...;

...;

if(...) {

...

}

}

Figure by Tim Foley

Page 34: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

34Winter 2011 – Beyond Programmable Shading

GPU Compute Models

…barrier barrier

Work-group Work-group

Slide by Tim Foley

Page 35: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

35Winter 2011 – Beyond Programmable Shading

GPU Compute Use Cases

• 1:1 Mapping

• Simple Fork/Join

• Switching Axes of Parallelism

IncreasingSophistication

Page 36: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

36Winter 2011 – Beyond Programmable Shading

1:1 Mapping

• One work item per ray / per pixel / per matrix element

• Every work item executes the same kernel

• Often first, most obvious solution to a problem

• “Pure data parallelism”

void saxpy( int i,float a,const float* x,const float* y,float* result )

{result[i] = a * x[i] + y[ i ];

}

Slide by Tim Foley

Page 37: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

37Winter 2011 – Beyond Programmable Shading

Simple Fork/Join

• Some code must run at work-group granularity

– Example: work items cooperate to compute output structure size

– Atomic operation to allocate output must execute once

• Idiomatic solution

– Barrier, then make work item #0 do the group-wide operation

void subdividePolygon(...){

shared int numOutputPolygons = 0;

// in parallel, every work item doesatomic_add( numOutputPolygons, 1);barrier();

Polygon* output = NULL;if( workItemID == 0 ) {

output = allocateMemory( numOutputPolygons );}barrier();...

}

Slide by Tim Foley

Page 38: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

38Winter 2011 – Beyond Programmable Shading

Multiple Axes of Parallelism

• Deferred rendering with DX11 Compute Shader

– Example from Johan Andersson (DICE)

– 1000+ dynamic lights

• Multiple phases of execution

– Work group responsible for a screen-space tile

– Each phase exploits work items differently:

– Phase 1: pixel-parallel computation of tile depth bounds

– Phase 2: light-parallel test for intersection with tile

– Phase 3: pixel-parallel accumulation of lighting

• Exploits producer-consumer locality between phases

Slide by Tim Foley

Picture by Johan Andersson

Page 39: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

39Winter 2011 – Beyond Programmable Shading

Terminology Decoder Ring

Direct Compute

CUDA OpenCLPthreads+

SSEThis talk

thread thread work-item SIMD lane work-item

- warp - threadexecution context

threadgroup threadblock Work-group - work-group

-streaming

multiprocessor

compute unit

core core

- grid N-D range -Set of work-

groups

Slide by Tim Foley

Page 40: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

40Winter 2011 – Beyond Programmable Shading

When Use GPU Compute vs Pixel Shader?

• Use GPU compute language if your algorithm needs on-chip memory

– Reduce bandwidth by building local data structures

• Otherwise, use pixel shader

– All mapping, decomposition, and scheduling decisions automatic

– (Easier to reach peak performance)

Page 41: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

41Winter 2011 – Beyond Programmable Shading

GPU Compute Languages Review

• “Write code from within two nested concurrent/parallel loops”

• Abstracts

– Cores, execution contexts, and SIMD ALUs

• Exposes

– Parallel execution contexts on same core

– Fast R/W on-core memory shared by the execution contexts on same core

• Synchronization

– Fine grain: between execution contexts on same core

– Very coarse: between large sets of concurrent work

– No medium-grain synchronization “between function calls” like task systems provide

Page 42: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

42Winter 2011 – Beyond Programmable Shading

Conventional Thread Parallelism on GPUs

• Also called “persistent threads”

• “Expert” usage model for GPU compute

– Defeat abstractions over cores, execution contexts, and SIMD functional units

– Defeat system scheduler, load balancing, etc.

– Code not portable between architectures

Page 43: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

43Winter 2011 – Beyond Programmable Shading

• Execution

– Two-level parallel execution model

– Lower level: parallel execution of M identical tasks on M-wide SIMD

functional unit

– Higher level: parallel execution of N different tasks on N execution contexts

• What is abstracted?

– Nothing (other than automatic mapping to SIMD lanes)

• Where is synchronization allowed?

– Lower-level: between any task running on same SIMD functional unit

– Higher-level: between any execution context

Conventional Thread Parallelism on GPUs

Page 44: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

44Winter 2011 – Beyond Programmable Shading

Why Persistent Threads?

• Enable alternate programming models that require different scheduling and synchronization rules than the default model provides

• Example alternate programming models

– Task systems (esp. nested task parallelism)

– Producer-consumer rendering pipelines

– (See references at end of this slide deck for more details)

Page 45: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

45Winter 2011 – Beyond Programmable Shading

Summary of Concepts

• Abstraction

– When a parallel programming model abstracts a HW resource, code written in that

programming model scales across architectures with varying amounts of that resource

• Execution

– Concurrency versus parallelism

• Synchronization

– Where is user allowed to control scheduling?

Page 46: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

46Winter 2011 – Beyond Programmable Shading

“Ideal Parallel Programming Model”

• Combine the best of CPU and GPU programming models

– Task systems are great for scheduling (from CPUs)

– “Asynchronous function call” is easy to understand and use

– Great load balancing and scalability (with cores, execution contexts)

– SPMD programming is great for utilizing SIMD (from GPUs)

– “Write sequential code that is instanced N times across N-wide SIMD”

– Intuitive: only slightly different from sequential programming

• Why not just “launch tasks that run fine-grain SPMD code?”

– The future on CPU and GPU?

Page 47: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

47Winter 2011 – Beyond Programmable Shading

Conclusions

• Task-, data- and pipeline-parallelism

– Three proven approaches to scalability

– Plentiful of concurrency with little exposed parallelism

– Applicable to many problems in visual computing

• Current real-time rendering programming uses a mix of data-, task-, and pipeline-parallel programming (and conventional threads as means to an end)

• Current GPU compute models designed for data-parallelism but can be abused to implement all of these other models

Page 48: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

48Winter 2011 – Beyond Programmable Shading

References

• GPU-inspired compute languages

– DX11 DirectCompute, OpenCL (CPU+GPU+…), CUDA

• Task systems (CPU and CPU+GPU+…)

– Cilk, Thread Building Blocks (TBB), Grand Central Dispatch (GCD), ConcRT, Task Parallel Library, OpenCL (limited in 1.0)

• Conventional CPU thread programming

– Pthreads

• GPU task systems and “persistent threads” (i.e., conventional thread programming on GPU)

– Aila et al, “Understanding the Efficiency of Ray Traversal on GPUs,” High Performance Graphics 2009

– Tzeng et al, “Task Management for Irregular-Parallel Workloads on the GPU,” High Performance Graphics 2010

– Parker et al, “OptiX: A General Purpose Ray Tracing Engine,” SIGGRAPH 2010

• Additional input (concepts, terminology, patterns, etc)

– Foley, “Parallel Programming for Graphics,”

– Beyond Programmable Shading SIGGRAPH 2009

– Beyond Programmable Shading CS448s Stanford course

– Fatahalian, “Running Code at a Teraflop: How a GPU Shader Core Works,” Beyond Programmable Shading SIGGRAPH 2009-2010

– Keutzer et al, “A Design Pattern Language for Engineering (Parallel) Software: Merging the PLPP and OPL projects, “ ParaPLoP2010

Page 49: Introduction to Parallel Programming for Real-Time Graphics … · 2011-02-02 · Shader (Data-parallel) Pseudocode concurrent_for( i = 0 to numVertices - 1) {… execute vertex shader

49Winter 2011 – Beyond Programmable Shading

Questions?

http://www.cs.washington.edu/education/courses/cse558/11wi/