Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu...
-
Upload
clarence-holland -
Category
Documents
-
view
219 -
download
0
Transcript of Department of Computer Science 1 Beyond CUDA/GPUs and Future Graphics Architectures Karu...
Department of Computer Science
1
Beyond CUDA/GPUs and Future Graphics Architectures
Karu SankaralingamUniversity of Wisconsin-Madison
Adapted from “Toward A Multicore Architecture for Real-time Raytracing, MICRO-41, 2008, Venkatraman Govindaraju, Peter
Djeu, Karthikeyan Sankaralingam, Mary Vernon, William R. Mark.
Department of Computer Science
2
Real-time Graphics Rendering
Today
Department of Computer Science
3
Real-time Graphics RenderingToday Future
Department of Computer Science
4
Real-time Graphics Rendering
What are the problems?How can we get there?
Department of Computer Science
What is wrong with this picture?
5
Department of Computer Science
GPU/CUDA
6
Z-buffer
Department of Computer Science
7
Z-buffer
Arch
“Ptolemic” Graphic Universe
Architecture, application all optimized for Z-buffer Difficult to render images with realistic effects.
– self-reflection, soft shadows, ambient occlusion Problems:
– Scene constraints, Artist and programmer productivity
Application
Department of Computer Science
Current Graphics Architectures
8
Courtesy: ACM Queue
Department of Computer Science
How did we get here?
Hardware Rasterizers and perspective-correct texture mapping (RIVA 128)
Single Pass Multitexture (TNT / TNT2) Register Combiners: a generalization of
multitexture (GeForce 256) Per-pixel Shading (Geforce 2 GTS) Programmable Hardware Pixel Shading Programmable Vertex Shading CUDA
9
Department of Computer Science
10
AlgorithmArch
“Copernican” Graphic Universe
Architecture, application revolves around Algorithm
More general purpose algorithm Easier to provide realistic effects Architecture can support other applications
Application
Ray-tracin
g
Department of Computer Science
Future Graphics Architectures
11
Courtesy: ACM Queue
Department of Computer Science
12
Executive Summary: Copernicus System
Co-designed application, architecture and analysis framework
Path from specialized graphics architecture to more general purpose architecture.
A detailed characterization and analysis framework
Real-time frame rates possible for high quality dynamic scenes
Department of Computer Science
13
Outline
Motivation Copernicus system
– Graphics Algorithm: Razor– Architecture– Evaluation and Results
Summary
Department of Computer Science
14
Ray-tracingFull
scene
Cube Cylinder
Simulating the behavior light rays through 3D scene
Rays from eye to scene (Primary rays) Rays from hitpoint to light (Secondary rays) Acceleration structure (eg. BSP Tree) for
efficiency
Department of Computer Science
15
Disadvantages of Raytracing
Every frame need to rebuild the acceleration structure for dynamic scenes.
Irregular data accesses for traversing the acceleration structure.
Higher resolution secondary ray tracing computation
Department of Computer Science
16
Razor: A Dynamic Multiresolution Raytracer
Cube Cylinder
Thread 1 Thread 2
Packet ray-tracer: Traces beam of rays instead of a ray– Opportunity for data level parallelism
Each thread lazily builds its own acceleration structure(KD Tree)– Builds the portion of structure it needs.
Department of Computer Science
17
Razor: A Dynamic Multiresolution Raytracer
Multi-level resolution to reduce secondary rays computation.
Replicates KD-Tree to reduce synchronization across threads. – Hypothesis: Duplication across threads will be
limited.
Department of Computer Science
18
Razor Implementation
Linux/x86– Implemented Razor in Intel Clovertown.– Parallelized using pthreads.
Optimized with SSE instructions Sustains 1 FPS on this prototype system Helps develop algorithms Designed with future hardware in mind
Department of Computer Science
19
Razor’s Memory Usage
# Threads
Mem
ory
footp
rin
t
Department of Computer Science
20
Parallel Scalability
# Threads
Sp
eed
up
1
2
3
4
5
6
1 2 3 4 5 6 7 8
CourtyardFairyforestForestJuarezSaloon
Department of Computer Science
21
Outline
Motivation Copernicus system
– Graphics Algorithm: Razor– Architecture– Evaluation and Results
Summary
Department of Computer Science
22
Architecture: Core
• Inorder core• Private L1 Data
and Instruction Cache
• Supports SIMD instructions
• SMT Threads to hide memory latency
Department of Computer Science
23
Architecture: Tile
• Shared L2 cache• Shared
Accelerator for specialized instructions
Department of Computer Science
24
Architecture: Chip
Department of Computer Science
25
Architecture Razor Mapping
Assigned to Tile
Assigned to Core
Department of Computer Science
26
Outline
Motivation Copernicus system
– Graphics Algorithm: Razor– Architecture– Evaluation and Results
Summary
Department of Computer Science
27
Benchmark Scenes
v
Courtyard Fairyforest Forest
Juarez Saloon
Department of Computer Science
28
Evaluation Methodology
Simulation with Multifacet/GEMS– Simulate SSE Instructions– Simulate a full tile– Validated with prototype data
• Pin-based and PAPI-based performance counters
– Randomly selected regions of scenes
Full chip– Simulating full chip is too slow– Build customized analytic model
Department of Computer Science
29
Analytical Model
Core Level– Pipeline stalls– Multiple threads
Tile Level– L2 contention
Chip Level– Main memory contention
Compared with our simulation results
Department of Computer Science
30
Single Core Performance (Single Issue)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Courtyard Fairyforest Forest Juarez Saloon
No SMT 2 SMT 4 SMT
IPC
Department of Computer Science
31
Single Core Performance (Dual Issue)
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Courtyard Fairyforest Forest Juarez Saloon
No SMT 2 SMT 4 SMT
IPC
Department of Computer Science
32
Single Tile Performance
0
1
2
3
4
5
6
7
8
Courtyard Fairyforest Forest Juarez Saloon
No SMT 2 SMT 4 SMT
IPC
Department of Computer Science
33
Full Chip Performance
0
20
40
60
80
0 2 4 6 8 10 12 14 16
Ideal1 DIMM2 DIMMs3 DIMMs4 DIMMs
#Tiles
Mil
lion
R
ays
/Seco
nd
s
Department of Computer Science
34
So, Are we there yet?
Department of Computer Science
35
Results
Goal: 100 Million rays per second Achieved: 50 Million rays per second
– With 16 tiles and 4 DIMMs
Insights:– 4 SMT single issue is ideal for this workload– Good parallel scalability– Razor’s physically-motivated optimizations work
Potential for further architectural optimizations– Shared accelerator– Wide SIMD bundles
Department of Computer Science
36
Outline
Motivation Copernicus system
– Graphics Algorithm: Razor– Architecture– Evaluation and Results
Summary
Department of Computer Science
37
Summary
A transformation path to ray-tracing– Ptolemic universe to Copernican graphics universe
Unique architecture design point– Tradeoff data redundancy and re-computation over
synchronization
Evaluation methodology interesting in its own right– Prototype, simulation and analytical framework to design
and evaluate future systems
Future work– Instructions specialization and shared accelerator design– Tradeoffs with SIMD width and area– Memory system
Department of Computer Science
38
Other Questions?