Many-Core Programming with GRAMPS
Jeremy SugermanStanford University
September 12, 2008
2
Background, Outline Stanford Graphics / Architecture Research
– Collaborators: Kayvon Fatahalian, Solomon Boulos, Kurt Akeley, Pat Hanrahan
To appear in ACM Transactions on Graphics
CPU, GPU trends… and collision? Two research areas:
– HW/SW Interface, Programming Model– Future Graphics API
3
Problem Statement Drive efficient development and execution in
many-/multi-core systems. Support homogeneous, heterogeneous cores. Inform future hardware
Status Quo: GPU Pipeline (Good for GL, otherwise hard) CPU (No guidance, fast is hard)
4
Software defined graphs Producer-consumer, data-parallelism Initial focus on rendering
GRAMPSInput
FragmentQueue
OutputFragment
Queue
Rasterization Pipeline
Ray Tracing Graph
= Thread Stage= Shader Stage= Fixed-func Stage
= Queue= Stage Output
RayQueue
Ray HitQueue Fragment
Queue
Camera Intersect
Shade FB Blend
Shade FB BlendRasterize
5
As a Graphics Evolution Not (too) radical for ‘graphics’ Like fixed → programmable shading
– Pipeline undergoing massive shake up– Diversity of new parameters and use cases
Bigger picture than ‘graphics’– Rendering is more than GL/D3D– Compute is more than rendering– Some ‘GPUs’ are losing their innate pipeline
6
As a Compute Evolution (1) Sounds like streaming:
Execution graphs, kernels, data-parallelism
Streaming: “squeeze out every FLOP”– Goals: bulk transfer, arithmetic intensity– Intensive static analysis, custom chips (mostly)– Bounded space, data access, execution time
7
As a Compute Evolution (2) GRAMPS: “interesting apps are irregular”
– Goals: Dynamic, data-dependent code– Aggregate work at run-time– Heterogeneous commodity platforms
Naturally allows streaming when applicable
8
GRAMPS’ Role A ‘graphics pipeline’ is now an app! GRAMPS models parallel state machines.
Compared to status quo:– More flexible than a GPU pipeline– More guidance than bare metal– Portability in between– Not domain specific
9
GRAMPS Interfaces Host/Setup: Create execution graph
Thread: Stateful, singleton
Shader: Data-parallel, auto-instanced
GRAMPS Entities (1) Accessed via windows
Queues: Connect stages, Dynamically sized– Ordered or unordered– Fixed max capacity or spill to memory
Buffers: Random access, Pre-allocated– RO, RW Private, RW Shared (Not Supported)
GRAMPS Entities (2) Queue Sets: Independent sub-queues
– Instanced parallelism plus mutual exclusion– Hard to fake with just multiple queues
12
What We’ve Built (System)
13
GRAMPS Scheduler Tiered Scheduler
‘Fat’ cores: per-thread, per-core
‘Micro’ cores: shared hw scheduler
Top level: tier N
14
What We’ve Built (Apps)Direct3D Pipeline (with Ray-tracing Extension)
Ray-tracing Graph
IA 1 VS 1 RO Rast
Trace
IA N VS N
PS
SampleQueue Set
RayQueue
PrimitiveQueue
Input VertexQueue 1
PrimitiveQueue 1
Input VertexQueue N
OM
PS2
FragmentQueue
Ray HitQueue
Ray-tracing Extension
PrimitiveQueue N
Tiler
Shade FB Blend
SampleQueue
TileQueue
RayQueue
Ray HitQueue
FragmentQueue
CameraSampler Intersect
= Thread Stage= Shader Stage= Fixed-func
= Queue= Stage Output= Push Output
15
Initial Results Queues are small, utilization is good
16
GRAMPS Visualization
17
GRAMPS Visualization
18
GRAMPS Portability Portability really means performance.
Less portable than GL/D3D– GRAMPS graph is (more) hardware sensitive
More portable than bare metal– Enforces modularity– Best case, just works – Worst case, saves boiler plate
19
High-level Challenges Is GRAMPS a suitable GPU evolution?
– Enable pipeline competitive with bare metal?– Enable innovation: advanced / alternative
methods?
Is GRAMPS a good parallel compute model?– Map well to hardware, hardware trends?– Support important apps?– Concepts influence developers?
20
What’s Next: Implementation Better scheduling
– Less bursty, better slot filling– Dynamic priorities– Handle graphs with loops better
More detailed costs– Bill for scheduling decisions– Bill for (internal) synchronization
More statistics
21
What’s Next: Programming Model Yes: Graph modification (state change)
Probably: Data sharing / ref-counting
Maybe: Blocking inter-stage calls (join) Maybe: Intra/inter-stage synchronization primitives
22
What’s Next: Possible Workloads REYES, hybrid graphics pipelines Image / video processing Game Physics
– Collision detection or particles Physics and scientific simulation AI, finance, sort, search or database query, …
Heavy dynamic data manipulation- k-D tree / octree / BVH build- lazy/adaptive/procedural tree or geometry
Top Related