Massively Parallel Computation Using Graphics Processors ...
Dynamic Load-balancing On Graphics Processors
-
Upload
daced -
Category
Technology
-
view
1.496 -
download
0
description
Transcript of Dynamic Load-balancing On Graphics Processors
![Page 1: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/1.jpg)
On Dynamic Load Balancing on Graphics Processors
Daniel Cederman and Philippas TsigasChalmers University of Technology
![Page 2: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/2.jpg)
Overview
• Motivation
• Methods
• Experimental evaluation
• Conclusion
![Page 3: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/3.jpg)
The problem setting
Work
Task Task Task
Task Task Task Task
Task Task Task Task
Offline
Online
![Page 4: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/4.jpg)
Static Load Balancing
Processor Processor Processor Processor
![Page 5: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/5.jpg)
Static Load Balancing
Processor Processor Processor Processor
Task Task Task Task
![Page 6: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/6.jpg)
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
![Page 7: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/7.jpg)
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
Subtask Subtask Subtask Subtask
![Page 8: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/8.jpg)
Static Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
SubtaskSubtask
Subtask
Subtask
![Page 9: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/9.jpg)
Dynamic Load Balancing
Processor Processor Processor Processor
Task
Task
Task
Task
Subtask
SubtaskSubtask
Subtask
![Page 10: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/10.jpg)
Task sharing
Work done?
Try to get task
New tasks
?
Perform task
Got task?
Add task
Task Set
No, retry
Check condition
Acquire Task
Add Task
No, continue
Task
Task
Task
Task
Task
Done
![Page 11: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/11.jpg)
System Model
• CUDA
• Global Memory
• Gather and scatter
• Compare-And-Swap
• Fetch-And-Inc
• Multiprocessors
• Maximum number ofconcurrent thread blocks
Multi-processor
Thread Block
Thread Block
Thread Block
Multi-processor
Thread Block
Thread Block
Thread Block
Multi-processor
Thread Block
Thread Block
Thread Block
Global Memory
![Page 12: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/12.jpg)
Synchronization
• Blocking
• Uses mutual exclusion to only allow one process at a time to access the object.
• Lockfree
• Multiple processes can access the object concurrently. At least one operation in a set of concurrent operations finishes in a finite number of its own steps.
• Waitfree
• Multiple processes can access the object concurrently. Every operation finishes in a finite number of its own steps.
![Page 13: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/13.jpg)
Load Balancing Methods
• Blocking Task Queue
• Non-blocking Task Queue
• Task Stealing
• Static Task List
![Page 14: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/14.jpg)
Blocking queue
TB 1
TB 2
TB n
Free
Head
Tail
![Page 15: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/15.jpg)
Blocking queue
TB 1
TB 2
TB n
Free
Head
Tail
![Page 16: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/16.jpg)
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
![Page 17: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/17.jpg)
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
![Page 18: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/18.jpg)
Blocking queue
T1
TB 1
TB 2
TB n
Free
Head
Tail
![Page 19: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/19.jpg)
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
ReferenceP. Tsigas and Y. Zhang, A simple, fast and scalable non-blocking concurrent FIFO queue for shared memory multiprocessor systems [SPAA01]
![Page 20: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/20.jpg)
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 21: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/21.jpg)
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 22: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/22.jpg)
Non-blocking Queue
T1 T2 T3 T4
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 23: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/23.jpg)
Non-blocking Queue
T1 T2 T3 T4 T5
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 24: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/24.jpg)
Non-blocking Queue
T1 T2 T3 T4 T5
TB 1
TB 2
TB 1
TB 2
TB n
Head
Tail
![Page 25: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/25.jpg)
Task stealing
T1
T3 T2
TB 1
TB 2
TB n
ReferenceArora N. S., Blumofe R. D., Plaxton C. G. , Thread Scheduling for Multiprogrammed Multiprocessors [SPAA 98]
![Page 26: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/26.jpg)
Task stealing
T1 T4
T3 T2
TB 1
TB 2
TB n
![Page 27: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/27.jpg)
Task stealing
T1 T4 T5
T3 T2
TB 1
TB 2
TB n
![Page 28: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/28.jpg)
Task stealing
T1 T4
T3 T2
TB 1
TB 2
TB n
![Page 29: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/29.jpg)
Task stealing
T1
T3 T2
TB 1
TB 2
TB n
![Page 30: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/30.jpg)
Task stealing
T3 T2
TB 1
TB 2
TB n
![Page 31: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/31.jpg)
Task stealing
T2
TB 1
TB 2
TB n
![Page 32: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/32.jpg)
Static Task List
T1
T2
T3
T4
In
![Page 33: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/33.jpg)
Static Task List
T1
T2
T3
T4
In
TB 1
TB 2
TB 3
TB 4
![Page 34: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/34.jpg)
Static Task List
T1
T2
T3
T4
InOut
TB 1
TB 2
TB 3
TB 4
![Page 35: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/35.jpg)
Static Task List
T1
T2
T3
T4
T5
InOut
TB 1
TB 2
TB 3
TB 4
![Page 36: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/36.jpg)
Static Task List
T1
T2
T3
T4
T5
T6
InOut
TB 1
TB 2
TB 3
TB 4
![Page 37: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/37.jpg)
Static Task List
T1
T2
T3
T4
T5
T6
T7
InOut
TB 1
TB 2
TB 3
TB 4
![Page 38: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/38.jpg)
Octree Partitioning
• Bandwidth bound
![Page 39: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/39.jpg)
Octree Partitioning
• Bandwidth bound
![Page 40: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/40.jpg)
Octree Partitioning
• Bandwidth bound
![Page 41: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/41.jpg)
Octree Partitioning
• Bandwidth bound
![Page 42: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/42.jpg)
Four-in-a-row
• Computation intensive
![Page 43: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/43.jpg)
Graphics Processors
8800GT• 14 Multiprocessors
• 57 GB/sec bandwidth
9600GT• 8 Multiprocessors
• 57 GB/sec bandwidth
![Page 44: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/44.jpg)
Blocking Queue – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
200
400
600
Time (ms)
ThreadsBlocks
Time (ms)
200
300
400
500
![Page 45: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/45.jpg)
Blocking Queue – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
200
400
600
800
Time (ms)
ThreadsBlocks
Time (ms)
200
400
600
800
![Page 46: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/46.jpg)
Blocking Queue – Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 500
1000 1500 2000 2500
Time (ms)
ThreadsBlocks
Time (ms)
500 1000 1500 2000 2500
![Page 47: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/47.jpg)
Non-blocking Queue – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
![Page 48: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/48.jpg)
Non-blocking Queue – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
![Page 49: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/49.jpg)
Non-blocking Queue - Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
50
100
150
200
Time (ms)
ThreadsBlocks
Time (ms)
100
150
200
![Page 50: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/50.jpg)
Task stealing – Octree/9600GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
0
50
100
150
200
![Page 51: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/51.jpg)
Task stealing – Octree/8800GT
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0 50
100 150 200 250
Time (ms)
ThreadsBlocks
Time (ms)
50
100
150
200
![Page 52: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/52.jpg)
Task stealing – Four-in-a-row
16 32
48 64
80 96
112 128
16 32
48 64
80 96
112 128
0
50
100
150
Time (ms)
ThreadsBlocks
Time (ms)
50
100
150
![Page 53: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/53.jpg)
Static List
8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 1280
20
40
60
80
100
120
140
Octree 9600GT Octree 8800GTS Four-in-a-row
Threads/Block
Tim
e (m
s)
![Page 54: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/54.jpg)
Octree Comparison
100 150 200 250 300 350 400 450 50010
100
Blocking Queue Non-Blocking Queue Static ListWork Stealing
Particles (thousands)
Tim
e (m
s)
![Page 55: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/55.jpg)
Previous work
• Korch M., Raubert T., A comparison of task pools for dynamic load balancing of irregular algorithms, Concurrency and Computation: Practice & Experience, 16, 2003
• Heirich A., Arvo J., A competetive analysis of load balancing strategies for parallel ray tracing, Journal of Supercomputing, 12, 1998
• Foley T., Sugerman J., KD-tree acceleration structures for a GPU raytracer, Graphics Hardware 2005
![Page 56: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/56.jpg)
Conclusion
• Synchronization plays a significant role in dynamic load-balancing
• Lock-free data structures/synchronization scales well and looks promising also in the GPU general purpose programming
• Locks perform poorly
• It is good that operations such as CAS and FAA have been introduced in the new GPUs
• Work stealing could outperform static load balancing
![Page 57: Dynamic Load-balancing On Graphics Processors](https://reader035.fdocuments.in/reader035/viewer/2022081413/545799c7b1af9fba5d8b49cd/html5/thumbnails/57.jpg)
Thank you!
http://www.cs.chalmers.se/~dcs