Near Optimal Work-Stealing Tree for Highly Irregular Data-Parallel Workloads
Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on...
Transcript of Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on...
![Page 1: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/1.jpg)
Task Management for Irregular-Parallel Workloads on the GPU
Stanley Tzeng, Anjul Patney, and John D. Owens University of California, Davis
![Page 2: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/2.jpg)
Introduction – Parallelism in Graphics Hardware
1
Input Queue
Output Queue
Process
Input Queue
Output Queue
Process
Regular Workloads
Irregular Workloads
Good matching of input size to output size
Vertex, Fragment Processor, etc.
Geometry Processor, Recursion
Unprocessed Work Unit Processed Work Unit
Hard to estimate
output given input
![Page 3: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/3.jpg)
• Increased programmability on GPUs allows different programmable pipelines on the GPU.
• We want to explore how pipelines can be efficiently mapped onto the GPU.
What if your pipeline has irregular stages ?
How should data between pipeline stages be stored ?
What about load balancing across all parallel units ?
What if your pipeline is more geared towards task parallelism rather than data parallelism?
Motivation – Programmable Pipelines
2
Our paper addresses these Issues!
![Page 4: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/4.jpg)
• Imagine that these pipeline stages were actually bricks.
• Then we are providing the mortar between the bricks.
In Other Words…
3
Pipeline Stages
Us
![Page 5: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/5.jpg)
• Alternative pipelines on the GPU: • Renderants [Zhou et al. 2009]
• Freepipe [Liu et al. 2009]
• Optix [NVIDIA 2010]
• Distributed Queuing on the GPU:
• GPU Dynamic Load Balancing [Cederman et al. 2008]
• Multi-CPU work
• Reyes on the GPU:
• Subdivision [Patney et al. 2008]
• Diagsplit [Fisher et al. 2009]
• Micropolygon Rasterization [Fatahalian et al. 2009]
Related Work
4
![Page 6: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/6.jpg)
Ingredients for Mortar
5
What is the proper granularity for tasks?
How many threads to launch?
How to avoid global synchronizations? How to distribute
tasks evenly?
Warp Size Work Granularity
Uberkernels
Persistent Threads
Task Donation
Questions that we need to address:
![Page 7: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/7.jpg)
• Problem: We want to emulate task level parallelism on the GPU without loss in efficiency.
• Solution: we choose block sizes of 32 threads / block. • Removes messy synchronization barriers.
• Can view each block as a MIMD thread. We call these blocks processors
Warp Size Work Granularity
6
Physical Processor
P P P
Physical Processor
Physical Processor
P P P P P P
![Page 8: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/8.jpg)
• Problem: Want to eliminate global kernel barriers for better processor utilization
• Uberkernels pack multiple execution routes into one kernel.
Uberkernel Processor Utilization
7
Pipeline Stage 1
Pipeline Stage 2
Data Flow
Uberkernel
Stage 1
Stage 2
Data Flow
![Page 9: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/9.jpg)
Life of a Thread:
Persistent Thread Scheduler Emulation
8
Spawn Fetch Data
Process Data
Write Output Death
Life of a Persistent Thread:
How do we know when to stop? When there is no more work left
• Problem: If input is irregular? How many threads do we launch?
• Launch enough to fill the GPU, and keep them alive so they keep fetching work.
![Page 10: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/10.jpg)
• Problem: We need to ensure that our processors are constantly working and not idle.
• Solution: Design a software memory management system.
• How each processor fetches work is based on our queuing strategy.
• We look at 4 strategies: • Block Queues
• Distributed Queues
• Task Stealing
• Task Donation
Memory Management System
9
![Page 11: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/11.jpg)
• To obtain exclusive access to a queue each queue has a lock.
• Current implementation uses spin locks and are very slow on GPUs.
• We want to use as few locks as possible.
A Word About Locks
10
while (atomicCAS(lock, 0,1) ==1);
![Page 12: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/12.jpg)
• 1 dequeue for all processors. Read from one end write back to the other.
Block Queuing
11
P1 P2 P3 P4
Until the last element, fetching work
from the queue requires no
lock
Writing back to the queue is serial. Each
processor obtains lock before
writing.
Read Write
Excellent Load Balancing
Horrible Lock Contention
![Page 13: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/13.jpg)
• Each processor has its own dequeue (called a bin) and it reads and writes to it.
Distributed Queuing
12
P1 P2 P3 P4
This processor finished all its work and can
only idle
Eliminates Locks
Idle Processors
![Page 14: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/14.jpg)
This processor finished all its
work and steals from neighboring
processor
• Using the distributed queuing scheme, but now processors can steal work from another bin.
Task Stealing
13
P1 P2 P3 P4
Very Low Idle
Big Bin Sizes
![Page 15: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/15.jpg)
• When a bin is full, processor can give work to someone else.
Task Donation
14
This processor ‘s bin is full and donates
its work to someone else’s bin
P1 P2 P3 P4
Smaller memory usage
More complicated
![Page 16: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/16.jpg)
• Main measure to compare: • How many iterations the processor is idle due to lock
contention or waiting for other processors to finish.
• We use a synthetic work generator to precisely control the conditions.
Evaluating the Queues
15
![Page 17: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/17.jpg)
Average Idle Iterations Per Processor
16
Idle Iterations
25000
200
Block Queue
Dist. Queue
Task Stealing
Task Donation
Lock Contention
Idle Waiting
About the same performance
About the same performance
![Page 18: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/18.jpg)
How it All Fits Together
17
Spawn
Fetch Data
Process Data
Write Output
Death
Global Memory
![Page 19: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/19.jpg)
Our Version
18
Spawn
Fetch Data
Write Output
Death
Task Donation Memory System
Uberkernels
Pers
iste
nt T
hrea
ds
![Page 20: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/20.jpg)
Our Version
19
Spawn
Fetch Data
Write Output
Death
Task Donation Memory System
Uberkernels
Pers
iste
nt T
hrea
ds
We spawn 32 threads per thread group / block in our
grid. These are our processors.
![Page 21: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/21.jpg)
Our Version
20
Spawn
Fetch Data
Write Output
Death
Task Donation Memory System
Uberkernels
Pers
iste
nt T
hrea
ds
Each processor grabs work to
process.
![Page 22: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/22.jpg)
Our Version
21
Spawn
Fetch Data
Write Output
Death
Task Donation Memory System
Uberkernels
Pers
iste
nt T
hrea
ds
Uberkernel decies how to process the current work unit
![Page 23: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/23.jpg)
Our Version
22
Spawn
Fetch Data
Write Output
Death
Task Donation Memory System
Uberkernels
Pers
iste
nt T
hrea
ds
Once work is processed, thread execution returns to fetching more
work
![Page 24: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/24.jpg)
Our Version
23
Spawn
Fetch Data
Write Output
Death
Task Donation Memory System
Uberkernels
Pers
iste
nt T
hrea
ds
When there is no work left in the
queue, the threads retire
![Page 25: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/25.jpg)
APPLICATION: REYES
24
![Page 26: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/26.jpg)
Pipeline Overview
Start with smooth surfaces Obtain micropolygons
Subdivision / Tessellation
Shading
Rasterization / Sampling
Composition and Filtering
Scene
Image
![Page 27: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/27.jpg)
Pipeline Overview
Shade micropolygons Subdivision / Tessellation
Shading
Rasterization / Sampling
Composition and Filtering
Scene
Image
![Page 28: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/28.jpg)
Pipeline Overview
Map micropolygons to screen space
Subdivision / Tessellation
Shading
Rasterization / Sampling
Composition and Filtering
Scene
Image
![Page 29: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/29.jpg)
Pipeline Overview
Reconstruct pixels from obtained samples
Subdivision / Tessellation
Shading
Rasterization / Sampling
Composition and Filtering
Scene
Image
![Page 30: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/30.jpg)
Pipeline Overview
Subdivision / Tessellation
Shading
Rasterization / Sampling
Scene
Image
Composition and Filtering
Irregular Input and Output
Regular Input and Output
Regular Input Irregular Output
Irregular Input Regular Output
![Page 31: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/31.jpg)
• We combine the patch split and dice stage into one kernel.
• Bins are loaded with initial patches. • 1 processor works on 1 patch at a time. Processor can
write back split patches into bins. • Output is a buffer of micropolygons
Split and Dice
30
![Page 32: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/32.jpg)
• 32 Threads on 16 CPs – 16 threads each work in u and v • Calculate u and v thresholds, and then go to uberkernel
branch decision: • Branch 1 splits the patch again
• Branch 2 dices the patch into micropolygons
Split and Dice
31
Bin 0
Bin 0
µpoly buffer
![Page 33: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/33.jpg)
• Stamp out samples for each micropolygon. 1 processor per micropolygon patch.
• Since only output is irregular, use a block queue. • Write out to a sample buffer.
Sampling
32
P1 P2 P3 P4
0 15000
Read &
Stamp
Atomic Add
Writeout
µpolygons queue
processor
counter
Global Memory
![Page 34: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/34.jpg)
• Stamp out samples for each micropolygon. 1 processor per micropolygon patch.
• Since only output is irregular, use a block queue. • Write out to a sample buffer.
Sampling
33
P1 P2 P3 P4
0 15000
Read &
Stamp
Atomic Add
Writeout
µpolygons queue
processor
counter
Global Memory
Q: Why are you using a block queue? Didn’t you just show block queues have high contention when processors write back into it?
A: True. However, we’re not writing into this queue! We’re just reading from it, so there is no writeback idle time.
![Page 35: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/35.jpg)
Smooth Surfaces, High Detail
16 samples per pixel >15 frames per second on GeForce GTX280
![Page 36: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/36.jpg)
• What other (and better) abstractions are there for programmable pipelines?
• How is future GPU design going to affect software schedulers?
• For Reyes: What is the right model to do GPU real time micropolygon shading?
What’s Next
35
![Page 37: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/37.jpg)
• Matt Pharr, Aaron Lefohn, and Mike Houston • Jonathan Ragan-Kelley • Shubho Sengupta • Bay Raitt and Headus Inc. for models • National Science Foundation • SciDAC • NVIDIA Graduate Fellowship
Acknowledgments
36
![Page 38: Task Management for Irregular- Parallel Workloads … Management for Irregular-Parallel Workloads on the GPU Stanley Tzeng, Anjul Patney, and John D. Owens University of California,](https://reader031.fdocuments.in/reader031/viewer/2022030515/5abfff057f8b9ac6688b9f91/html5/thumbnails/38.jpg)
Thank You